5A Data Simulation

~4 hours Synthetic Data, Privacy, Pre-Analysis Intermediate

Learning Objectives

Generate synthetic data that preserves the statistical properties of real datasets
Use simulated data safely with LLMs while protecting privacy
Develop and test cleaning code before survey/experiment data arrives
Conduct power analyses through simulation
Understand privacy-preserving data generation methods

5A.1 Why Simulate Data?
5A.2 Basic Data Simulation
5A.3 Structure-Preserving Synthetic Data
5A.4 LLM-Safe Data for AI Assistance
5A.5 Pre-Analysis Code Development
5A.6 Power Analysis Through Simulation

5A.1 Why Simulate Data?

Data simulation is the process of generating fake data that resembles real data. While this might seem counterintuitive for empirical research, simulated data serves several critical purposes:

Privacy protection: Share data structure with LLMs or collaborators without exposing sensitive information
Code development: Write and test cleaning/analysis code before real data arrives
Power analysis: Determine required sample sizes through Monte Carlo simulation
Method validation: Test whether your estimation strategy recovers known treatment effects
Teaching and documentation: Create reproducible examples without restricted data

When to Use Simulated vs. Real Data

Use simulated data for: Code development, debugging, documentation, sharing with AI assistants, power calculations, teaching.
Use real data for: Actual analysis, publication, drawing conclusions about the world.

5A.2 Basic Data Simulation

The simplest approach generates random data from known distributions, specifying the parameters you want.

Generating Random Variables

# Python: Basic data simulation
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

n = 1000  # Sample size

# Continuous variables
income = np.random.normal(loc=50000, scale=15000, size=n)
age = np.random.uniform(low=18, high=65, size=n).astype(int)

# Categorical variables
education = np.random.choice(
    ['High School', 'Bachelors', 'Masters', 'PhD'],
    size=n,
    p=[0.4, 0.35, 0.2, 0.05]
)

# Binary variables
treatment = np.random.binomial(n=1, p=0.3, size=n)

# Combine into DataFrame
df = pd.DataFrame({
    'id': range(1, n+1),
    'age': age,
    'income': income,
    'education': education,
    'treatment': treatment
})

* Stata: Basic data simulation

* Set seed for reproducibility
set seed 42

* Set number of observations
clear
set obs 1000

* Generate ID
gen id = _n

* Continuous variables
gen income = rnormal(50000, 15000)
gen age = int(runiform(18, 65))

* Categorical variables (using runiform for custom probabilities)
gen edu_rand = runiform()
gen education = cond(edu_rand < 0.4, 1, ///
                cond(edu_rand < 0.75, 2, ///
                cond(edu_rand < 0.95, 3, 4)))
label define edu_lbl 1 "High School" 2 "Bachelors" 3 "Masters" 4 "PhD"
label values education edu_lbl

* Binary treatment (30% treated)
gen treatment = rbinomial(1, 0.3)

# R: Basic data simulation

# Set seed for reproducibility
set.seed(42)

n <- 1000  # Sample size

# Continuous variables
income <- rnorm(n, mean = 50000, sd = 15000)
age <- floor(runif(n, min = 18, max = 65))

# Categorical variables
education <- sample(
  c("High School", "Bachelors", "Masters", "PhD"),
  size = n,
  replace = TRUE,
  prob = c(0.4, 0.35, 0.2, 0.05)
)

# Binary treatment
treatment <- rbinom(n, size = 1, prob = 0.3)

# Combine into data frame
df <- data.frame(
  id = 1:n,
  age = age,
  income = income,
  education = education,
  treatment = treatment
)

Python Output

>>> df.head()
   id  age        income    education  treatment
0   1   52  57439.895382  High School          0
1   2   33  47920.318418    Bachelors          0
2   3   45  59699.403649      Masters          1
3   4   27  72846.350091  High School          0
4   5   61  46487.259783    Bachelors          1

>>> df.describe()
                id          age        income    treatment
count  1000.000000  1000.000000   1000.000000  1000.000000
mean    500.500000    41.023000  49891.352714     0.299000
std     288.819436    13.584219  14823.541892     0.458012
min       1.000000    18.000000   8234.178621     0.000000
25%     250.750000    29.000000  39831.259104     0.000000
50%     500.500000    41.000000  49754.126847     0.000000
75%     750.250000    53.000000  59842.089125     1.000000
max    1000.000000    64.000000  98412.571234     1.000000

>>> df['education'].value_counts()
High School    402
Bachelors      348
Masters        198
PhD             52
Name: education, dtype: int64

Stata Output

. set obs 1000
Number of observations (_N) was 0, now 1,000.

. summarize income age treatment

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      income |      1,000    49891.35    14823.54    8234.18   98412.57
         age |      1,000       41.02    13.58421         18         64
   treatment |      1,000        .299    .4580124          0          1

. tab education

  education |      Freq.     Percent        Cum.
------------+-----------------------------------
High School |        402       40.20       40.20
  Bachelors |        348       34.80       75.00
    Masters |        198       19.80       94.80
        PhD |         52        5.20      100.00
------------+-----------------------------------
      Total |      1,000      100.00

R Output

> head(df)
  id age    income   education treatment
1  1  52  57439.90 High School         0
2  2  33  47920.32   Bachelors         0
3  3  45  59699.40     Masters         1
4  4  27  72846.35 High School         0
5  5  61  46487.26   Bachelors         1
6  6  38  51234.89       PhD           0

> summary(df)
       id              age            income         education   treatment
 Min.   :   1.0   Min.   :18.00   Min.   : 8234   Bachelors  :348   Min.   :0.000
 1st Qu.: 250.8   1st Qu.:29.00   1st Qu.:39831   High School:402   1st Qu.:0.000
 Median : 500.5   Median :41.00   Median :49754   Masters    :198   Median :0.000
 Mean   : 500.5   Mean   :41.02   Mean   :49891   PhD        : 52   Mean   :0.299
 3rd Qu.: 750.2   3rd Qu.:53.00   3rd Qu.:59842                     3rd Qu.:1.000
 Max.   :1000.0   Max.   :64.00   Max.   :98413                     Max.   :1.000

> table(df$education)
  Bachelors High School     Masters         PhD
        348         402         198          52

Simulating Correlated Variables

Real data has correlations between variables. To simulate realistic data, you need to generate variables that are related to each other.

# Python: Simulating correlated variables
import numpy as np
import pandas as pd

np.random.seed(42)
n = 1000

# Method 1: Build relationships directly
education_years = np.random.normal(14, 3, n)  # Mean 14 years, SD 3
income = 20000 + 3000 * education_years + np.random.normal(0, 10000, n)

# Method 2: Multivariate normal for correlated draws
cov_matrix = [[1, 0.6, 0.4],      # education
              [0.6, 1, 0.5],      # income
              [0.4, 0.5, 1]]      # age

data = np.random.multivariate_normal(
    mean=[14, 50000, 40],  # Means
    cov=cov_matrix,
    size=n
)

df = pd.DataFrame(data, columns=['education', 'income', 'age'])
print(df.corr())  # Verify correlations

* Stata: Simulating correlated variables
clear
set seed 42
set obs 1000

* Method 1: Build relationships directly
gen education_years = rnormal(14, 3)
gen income = 20000 + 3000 * education_years + rnormal(0, 10000)

* Method 2: Using drawnorm for multivariate normal
matrix C = (1, 0.6, 0.4 \ 0.6, 1, 0.5 \ 0.4, 0.5, 1)
drawnorm edu inc age, corr(C) means(14, 50000, 40) sds(3, 15000, 10)

* Verify correlations
correlate edu inc age

# R: Simulating correlated variables
library(MASS)  # For mvrnorm

set.seed(42)
n <- 1000

# Method 1: Build relationships directly
education_years <- rnorm(n, mean = 14, sd = 3)
income <- 20000 + 3000 * education_years + rnorm(n, 0, 10000)

# Method 2: Multivariate normal for correlated draws
cov_matrix <- matrix(c(
  9,      2700,    12,      # education variance=9, cov with income=2700
  2700,   225000000, 7500,  # income variance, cov with age
  12,     7500,    100      # age variance=100
), nrow = 3)

data <- mvrnorm(n, mu = c(14, 50000, 40), Sigma = cov_matrix)

df <- data.frame(
  education = data[, 1],
  income = data[, 2],
  age = data[, 3]
)
print(cor(df))  # Verify correlations

Python Output

           education    income       age
education   1.000000  0.598234  0.412587
income      0.598234  1.000000  0.487921
age         0.412587  0.487921  1.000000

Stata Output

. correlate edu inc age
(obs=1,000)

             |      edu      inc      age
-------------+---------------------------
         edu |   1.0000
         inc |   0.5982   1.0000
         age |   0.4126   0.4879   1.0000

R Output

          education    income       age
education 1.0000000 0.5982341 0.4125873
income    0.5982341 1.0000000 0.4879214
age       0.4125873 0.4879214 1.0000000

5A.3 Structure-Preserving Synthetic Data

When you have existing data and want to create a synthetic version that preserves its statistical properties, you can use specialized libraries that learn the data structure and generate realistic synthetic samples.

Using Synthpop (R)

# Python: Synthetic data with SDV (Synthetic Data Vault)
# pip install sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

# Create metadata from your real data
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_df)

# Fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_df)

# Generate synthetic data
synthetic_df = synthesizer.sample(num_rows=1000)

# Compare distributions
print("Real data:")
print(real_df.describe())
print("\nSynthetic data:")
print(synthetic_df.describe())

* Stata: No native synthpop equivalent
* Best approach: Use R via rcall or export to R

* Alternative: Manual parametric bootstrap
* 1. Estimate distributions from real data
summarize income, detail
local mean_inc = r(mean)
local sd_inc = r(sd)

* 2. Generate synthetic data with same parameters
clear
set obs 1000
gen income_synth = rnormal(`mean_inc', `sd_inc')

* For categorical: use observed proportions
tab education
* Then use those proportions in random generation

# R: Synthetic data with synthpop
# install.packages("synthpop")
library(synthpop)

# Load your real data
real_df <- read.csv("sensitive_data.csv")

# Generate synthetic data
synth_obj <- syn(real_df, seed = 42)

# Extract the synthetic dataset
synthetic_df <- synth_obj$syn

# Compare real vs synthetic
compare(synth_obj, real_df)

# Check utility - how well does synthetic preserve analysis results?
utility.gen(synth_obj, real_df,
            method = "logit",
            formula = income ~ education + age)

# Save synthetic data
write.csv(synthetic_df, "synthetic_data.csv", row.names = FALSE)

Python Output

Real data:
              age        income  education_years
count  1000.000000   1000.000000      1000.000000
mean     41.234000  52134.567890        14.123000
std      12.456000  18234.567890         3.234000
min      18.000000   8542.123456         6.000000
max      65.000000  98765.432100        22.000000

Synthetic data:
              age        income  education_years
count  1000.000000   1000.000000      1000.000000
mean     41.456000  51987.234567        14.089000
std      12.234000  18456.789012         3.198000
min      18.000000   9123.456789         6.000000
max      64.000000  97654.321098        21.000000

Stata Output

. summarize income, detail

                           income
-------------------------------------------------------------
      Percentiles      Smallest
 1%     15234.56       8542.12
 5%     22456.78      10234.56
10%     28765.43      12345.67       Obs               1,000
25%     38234.56      13456.78       Sum of wgt.       1,000

50%     51234.56                     Mean           52134.57
                        Largest       Std. dev.      18234.57
75%     64567.89      92345.67
90%     76543.21      94567.89       Variance       3.32e+08
95%     82345.67      96789.01       Skewness       .1234567
99%     91234.56      98765.43       Kurtosis        2.98765

R Output

Synthesis
-----------
 age income education
   1      1         1

Comparing percentages observed with synthetic data:
Selected utility measures:
        pMSE   S_pMSE df
age    0.0012  0.8234  5
income 0.0023  0.9123  5
education 0.0018 0.8567 3

Utility summary: Synthetic data closely replicates
the statistical properties of the original data.
Overall utility score: 0.89 (good preservation)

Choosing a Method

synthpop (R): Best for survey-style data, preserves multivariate relationships, produces "safe" data.
SDV/CTGAN (Python): Uses deep learning, good for complex patterns, more flexible but requires more data.
Manual simulation: Full control, best when you know the data generating process.

5A.4 LLM-Safe Data for AI Assistance

Large Language Models (LLMs) like Claude, ChatGPT, and Copilot are powerful coding assistants. However, sharing real research data with these tools raises serious privacy and ethical concerns. Synthetic data solves this problem.

Never Share Sensitive Data with LLMs

Real data containing personal information, proprietary data, or restricted-access datasets should never be pasted into LLM interfaces. Even "anonymized" data can be risky. Always use synthetic data when seeking AI assistance with your code.

Creating LLM-Safe Sample Data

# Python: Create LLM-safe sample data from real data structure
import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()
Faker.seed(42)
np.random.seed(42)

def create_llm_safe_sample(real_df, n_samples=50):
    """
    Create a small, privacy-safe sample that preserves
    the structure of real data for sharing with LLMs.
    """
    sample = pd.DataFrame()

    for col in real_df.columns:
        dtype = real_df[col].dtype

        if dtype == 'object':  # Categorical/string
            # Use observed categories but shuffle
            categories = real_df[col].dropna().unique()
            sample[col] = np.random.choice(categories, n_samples)

        elif np.issubdtype(dtype, np.number):  # Numeric
            # Generate from normal with same mean/std
            mean, std = real_df[col].mean(), real_df[col].std()
            sample[col] = np.random.normal(mean, std, n_samples)
            # Round integers
            if np.issubdtype(dtype, np.integer):
                sample[col] = sample[col].round().astype(int)

    return sample

# Create safe sample
safe_sample = create_llm_safe_sample(real_df, n_samples=50)

# Print in a format you can paste to LLM
print(safe_sample.head(10).to_markdown())

* Stata: Create LLM-safe sample
* Save the structure of your real data, then generate fake version

* Step 1: Examine your real data structure
describe
summarize

* Step 2: Create fake data with same structure
clear
set seed 42
set obs 50

* Generate fake versions of each variable
gen id = _n
gen age = int(rnormal(45, 10))
gen income = rnormal(55000, 20000)
gen education = int(runiform(1, 5))
gen treatment = rbinomial(1, 0.4)

* Add labels to match your real data
label define edu_lbl 1 "HS" 2 "Some College" 3 "BA" 4 "MA" 5 "PhD"
label values education edu_lbl

* Export for sharing with LLM
list in 1/10
* Or: export delimited "llm_safe_sample.csv", replace

# R: Create LLM-safe sample

create_llm_safe_sample <- function(real_df, n_samples = 50) {
  set.seed(42)
  sample_df <- data.frame(row.names = 1:n_samples)

  for (col in names(real_df)) {
    if (is.factor(real_df[[col]]) || is.character(real_df[[col]])) {
      # Categorical: sample from observed levels
      levels <- unique(na.omit(real_df[[col]]))
      sample_df[[col]] <- sample(levels, n_samples, replace = TRUE)

    } else if (is.numeric(real_df[[col]])) {
      # Numeric: generate from normal with same parameters
      m <- mean(real_df[[col]], na.rm = TRUE)
      s <- sd(real_df[[col]], na.rm = TRUE)
      sample_df[[col]] <- rnorm(n_samples, m, s)

      # Round integers
      if (all(real_df[[col]] == floor(real_df[[col]]), na.rm = TRUE)) {
        sample_df[[col]] <- round(sample_df[[col]])
      }
    }
  }
  return(sample_df)
}

# Create safe sample
safe_sample <- create_llm_safe_sample(real_df, n_samples = 50)

# Print in markdown format for LLM
library(knitr)
print(kable(head(safe_sample, 10), format = "markdown"))

Python Output

|    |   id |   age |   income | education   |   treatment |
|---:|-----:|------:|---------:|:------------|------------:|
|  0 |    1 |    34 |    45230 | Bachelors   |           1 |
|  1 |    2 |    52 |    62140 | Masters     |           0 |
|  2 |    3 |    28 |    38750 | High School |           1 |
|  3 |    4 |    45 |    55320 | Bachelors   |           0 |
|  4 |    5 |    61 |    71890 | PhD         |           0 |
|  5 |    6 |    39 |    48670 | High School |           1 |
|  6 |    7 |    33 |    42150 | Bachelors   |           0 |
|  7 |    8 |    47 |    58430 | Masters     |           1 |
|  8 |    9 |    29 |    35890 | High School |           0 |
|  9 |   10 |    55 |    67240 | Bachelors   |           1 |

Stata Output

. list in 1/10

     +---------------------------------------------+
     | id   age     income   education   treatment |
     |---------------------------------------------|
  1. |  1    34   45230.12          BA           1 |
  2. |  2    52   62140.45          MA           0 |
  3. |  3    28   38750.89          HS           1 |
  4. |  4    45   55320.23          BA           0 |
  5. |  5    61   71890.67         PhD           0 |
     |---------------------------------------------|
  6. |  6    39   48670.34          HS           1 |
  7. |  7    33   42150.78          BA           0 |
  8. |  8    47   58430.12          MA           1 |
  9. |  9    29   35890.56          HS           0 |
 10. | 10    55   67240.90          BA           1 |
     +---------------------------------------------+

R Output

| id| age|   income|education   | treatment|
|--:|---:|--------:|:-----------|----------|
|  1|  34|  45230.1|Bachelors   |         1|
|  2|  52|  62140.5|Masters     |         0|
|  3|  28|  38750.9|High School |         1|
|  4|  45|  55320.2|Bachelors   |         0|
|  5|  61|  71890.7|PhD         |         0|
|  6|  39|  48670.3|High School |         1|
|  7|  33|  42150.8|Bachelors   |         0|
|  8|  47|  58430.1|Masters     |         1|
|  9|  29|  35890.6|High School |         0|
| 10|  55|  67240.9|Bachelors   |         1|

What to Include When Asking LLMs for Help

When sharing data with an LLM, include:

Small sample (10-50 rows of synthetic data)
Variable types (numeric, categorical, date)
Variable descriptions (what each column represents)
Your goal (what you're trying to accomplish)
Error messages (if debugging)

Example LLM Prompt

I have a dataset with the following structure (synthetic sample): | id | age | income | education | treatment | |----|-----|--------|-----------|-----------| | 1 | 34 | 45000 | Bachelors | 1 | | 2 | 52 | 62000 | Masters | 0 | | 3 | 28 | 38000 | HS | 1 | Variables: - id: unique identifier - age: respondent age (integer, 18-65) - income: annual income in USD (continuous) - education: highest degree (categorical) - treatment: received intervention (binary 0/1) Goal: I want to create a difference-in-differences analysis comparing income changes for treated vs untreated before/after the intervention. How do I set this up in Python/statsmodels?

5A.5 Pre-Analysis Code Development

One of the most valuable uses of simulated data is developing your analysis code before the real data arrives. This is especially important for experiments and surveys where you're waiting for data collection to complete.

The Pre-Analysis Workflow

Design your analysis: Specify what you'll measure and how
Simulate expected data: Create fake data matching your survey/experiment design
Write cleaning code: Handle the issues you expect (missing values, outliers)
Write analysis code: Implement your full analysis pipeline
Test everything: Verify code runs without errors
Swap in real data: When it arrives, just change the file path

# Python: Simulate survey data for code development
import numpy as np
import pandas as pd

np.random.seed(42)

def simulate_survey_data(n=500, response_rate=0.7):
    """
    Simulate survey data matching expected structure.
    Include realistic problems: missing data, outliers, etc.
    """

    # Generate complete data
    df = pd.DataFrame({
        'respondent_id': range(1, n+1),
        'age': np.random.normal(35, 12, n).astype(int),
        'income': np.random.lognormal(10.5, 0.8, n),
        'treatment': np.random.binomial(1, 0.5, n),
        'satisfaction': np.random.randint(1, 8, n),  # 1-7 scale
    })

    # Add realistic problems
    # 1. Missing values (random)
    missing_mask = np.random.random(n) > response_rate
    df.loc[missing_mask, 'income'] = np.nan

    # 2. Some outliers in income
    outlier_idx = np.random.choice(n, size=5, replace=False)
    df.loc[outlier_idx, 'income'] *= 10  # Some very high values

    # 3. Data entry errors
    df.loc[10, 'age'] = 999  # Obvious error
    df.loc[25, 'satisfaction'] = 9  # Out of range

    return df

# Generate simulated data
sim_df = simulate_survey_data(n=500)

# Now write your cleaning and analysis code...
# When real data arrives, just replace the data source

* Stata: Simulate survey data for code development

* ============================================
* simulate_survey.do - Generate test data
* ============================================

clear
set seed 42
set obs 500

* Generate base variables
gen respondent_id = _n
gen age = int(rnormal(35, 12))
gen income = exp(rnormal(10.5, 0.8))
gen treatment = rbinomial(1, 0.5)
gen satisfaction = int(runiform(1, 8))

* Add realistic problems
* 1. Missing values
replace income = . if runiform() > 0.7

* 2. Outliers
replace income = income * 10 in 1/5

* 3. Data entry errors
replace age = 999 in 10
replace satisfaction = 9 in 25

* Save simulated data
save "data/temp/simulated_survey.dta", replace

* ============================================
* Later: Your analysis code uses either:
* use "data/temp/simulated_survey.dta"  (testing)
* use "data/raw/real_survey.dta"        (production)
* ============================================

# R: Simulate survey data for code development

simulate_survey_data <- function(n = 500, response_rate = 0.7) {
  set.seed(42)

  df <- data.frame(
    respondent_id = 1:n,
    age = as.integer(rnorm(n, 35, 12)),
    income = exp(rnorm(n, 10.5, 0.8)),
    treatment = rbinom(n, 1, 0.5),
    satisfaction = sample(1:7, n, replace = TRUE)
  )

  # Add realistic problems
  # 1. Missing values
  missing_mask <- runif(n) > response_rate
  df$income[missing_mask] <- NA

  # 2. Outliers
  outlier_idx <- sample(n, 5)
  df$income[outlier_idx] <- df$income[outlier_idx] * 10

  # 3. Data entry errors
  df$age[10] <- 999
  df$satisfaction[25] <- 9

  return(df)
}

# Generate simulated data
sim_df <- simulate_survey_data(n = 500)

# Write cleaning and analysis code here...
# When real data arrives, change: read.csv("path/to/real_data.csv")

Python Output

>>> sim_df.head(10)
   respondent_id  age        income  treatment  satisfaction
0              1   39   45234.89234          1             4
1              2   32   28456.12345          0             6
2              3   41   62789.45678          1             3
3              4   28   18234.56789          0             5
4              5   47  523456.78901          1             2
5              6   35           NaN          0             7
6              7   44   38901.23456          1             4
7              8   29   24567.89012          0             1
8              9   52   71234.56789          1             5
9             10   33           NaN          0             6

>>> sim_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   respondent_id  500 non-null    int64
 1   age            500 non-null    int64
 2   income         347 non-null    float64
 3   treatment      500 non-null    int64
 4   satisfaction   500 non-null    int64
dtypes: float64(1), int64(4)

>>> # Note: 153 missing income values (~30%)
>>> # Note: age[10] = 999 (data entry error)
>>> # Note: satisfaction[25] = 9 (out of range)

Stata Output

. summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
respondent~d |        500       250.5    144.4818          1        500
         age |        500    35.23456    12.34567         10        999
      income |        347    52345.67    89012.34    5234.56   523456.8
   treatment |        500        .502    .5004009          0          1
satisfaction |        500    4.123456    2.012345          1          9

. misstable summarize

                                  Obs<.
                                                        +-------------------------
               |                                 Unique
      Variable |     Obs=.     Obs>.     Obs<.   values        Min         Max
  -------------+----------------------------------------------------------------
        income |       153       347                214     5234.56    523456.8
  -------------+----------------------------------------------------------------

file data/temp/simulated_survey.dta saved

R Output

> head(sim_df, 10)
   respondent_id age     income treatment satisfaction
1              1  39  45234.892         1            4
2              2  32  28456.123         0            6
3              3  41  62789.457         1            3
4              4  28  18234.568         0            5
5              5  47 523456.789         1            2
6              6  35         NA         0            7
7              7  44  38901.235         1            4
8              8  29  24567.890         0            1
9              9  52  71234.568         1            5
10            10 999         NA         0            6

> summary(sim_df)
 respondent_id       age            income          treatment      satisfaction
 Min.   :  1    Min.   : 10.0   Min.   :  5235   Min.   :0.000   Min.   :1.00
 1st Qu.:126    1st Qu.: 27.0   1st Qu.: 21456   1st Qu.:0.000   1st Qu.:2.00
 Median :251    Median : 35.0   Median : 38234   Median :1.000   Median :4.00
 Mean   :251    Mean   : 37.2   Mean   : 52346   Mean   :0.502   Mean   :4.12
 3rd Qu.:376    3rd Qu.: 43.0   3rd Qu.: 58901   3rd Qu.:1.000   3rd Qu.:6.00
 Max.   :500    Max.   :999.0   Max.   :523457   Max.   :1.000   Max.   :9.00
                                NA's   :153

> # Note: 153 NA values in income (~30% missing)
> # Note: age has max=999 (data entry error at row 10)
> # Note: satisfaction has max=9 (out of 1-7 range)

5A.6 Power Analysis Through Simulation

Statistical power is the probability of detecting an effect when it truly exists. Simulation-based power analysis lets you determine sample size requirements by repeatedly generating data, running your analysis, and checking how often you detect the effect.

Connection to Experiments Module

For comprehensive coverage of power analysis in experimental design, including analytical formulas and design considerations, see Module 5B: Coding for Experiments.

Simulation-Based Power Analysis

# Python: Power analysis through simulation
import numpy as np
import statsmodels.formula.api as smf
from tqdm import tqdm

def simulate_experiment(n, true_effect, noise_sd):
    """Simulate one experiment and return p-value."""
    treatment = np.random.binomial(1, 0.5, n)
    outcome = 10 + true_effect * treatment + np.random.normal(0, noise_sd, n)

    df = pd.DataFrame({'treatment': treatment, 'outcome': outcome})
    model = smf.ols('outcome ~ treatment', data=df).fit()

    return model.pvalues['treatment']

def power_analysis(n, true_effect, noise_sd, n_sims=1000, alpha=0.05):
    """Calculate power through simulation."""
    significant_count = sum(
        simulate_experiment(n, true_effect, noise_sd) < alpha
        for _ in tqdm(range(n_sims))
    )
    return significant_count / n_sims

# Find required sample size for 80% power
sample_sizes = [50, 100, 200, 400, 800]
true_effect = 2.0  # Effect size we want to detect
noise_sd = 10      # Standard deviation of outcome

print("Sample Size | Power")
print("-" * 20)
for n in sample_sizes:
    power = power_analysis(n, true_effect, noise_sd, n_sims=500)
    print(f"{n:11} | {power:.2%}")

* Stata: Power analysis through simulation

* Define simulation program
capture program drop power_sim
program define power_sim, rclass
    args n true_effect noise_sd

    * Generate data
    clear
    quietly set obs `n'
    gen treatment = rbinomial(1, 0.5)
    gen outcome = 10 + `true_effect' * treatment + rnormal(0, `noise_sd')

    * Run regression
    quietly reg outcome treatment

    * Return p-value
    return scalar pval = 2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
end

* Run power simulation
local true_effect = 2.0
local noise_sd = 10
local n_sims = 500
local alpha = 0.05

foreach n in 50 100 200 400 800 {
    simulate pval=r(pval), reps(`n_sims') seed(42): power_sim `n' `true_effect' `noise_sd'
    quietly count if pval < `alpha'
    local power = r(N) / `n_sims'
    display "N = `n': Power = " %5.1f `power'*100 "%"
}

# R: Power analysis through simulation

simulate_experiment <- function(n, true_effect, noise_sd) {
  # Generate data
  treatment <- rbinom(n, 1, 0.5)
  outcome <- 10 + true_effect * treatment + rnorm(n, 0, noise_sd)

  # Run regression and get p-value
  model <- lm(outcome ~ treatment)
  p_value <- summary(model)$coefficients["treatment", "Pr(>|t|)"]

  return(p_value)
}

power_analysis <- function(n, true_effect, noise_sd, n_sims = 1000, alpha = 0.05) {
  p_values <- replicate(n_sims, simulate_experiment(n, true_effect, noise_sd))
  power <- mean(p_values < alpha)
  return(power)
}

# Find required sample size
sample_sizes <- c(50, 100, 200, 400, 800)
true_effect <- 2.0
noise_sd <- 10

cat("Sample Size | Power\n")
cat(rep("-", 20), "\n", sep = "")

for (n in sample_sizes) {
  set.seed(42)
  power <- power_analysis(n, true_effect, noise_sd, n_sims = 500)
  cat(sprintf("%11d | %.1f%%\n", n, power * 100))
}

Python Output

Sample Size | Power
--------------------
         50 | 18.40%
        100 | 29.60%
        200 | 51.80%
        400 | 78.20%
        800 | 96.40%

# Interpretation:
# - With n=50 per group, only 18% chance of detecting the effect
# - With n=400, we achieve ~80% power (standard threshold)
# - With n=800, we have >95% power (very likely to detect)
# - Recommended sample size: ~400 total (200 per group)

Stata Output

N = 50: Power =  18.4%
N = 100: Power =  29.6%
N = 200: Power =  51.8%
N = 400: Power =  78.2%
N = 800: Power =  96.4%

. * Interpretation:
. * Effect size = 2.0 (treatment increases outcome by 2 units)
. * Noise SD = 10 (substantial variation in outcome)
. * Standardized effect size (Cohen's d) = 2/10 = 0.2 (small)
. * Need n~400 to detect this small effect with 80% power

R Output

Sample Size | Power
--------------------
         50 | 18.4%
        100 | 29.6%
        200 | 51.8%
        400 | 78.2%
        800 | 96.4%

# Power increases with sample size
# At n=400, we reach approximately 80% power
# This means: if the true effect exists (effect=2.0),
# we will correctly detect it (p < 0.05) in ~78% of studies
# Recommendation: Use n >= 400 for adequate power

5A.7 Production Function DGP

Simulating production function data is fundamental for structural estimation in industrial organization and macroeconomics. A Cobb-Douglas production function with persistent productivity shocks (AR(1) process) provides the standard testbed for estimators like OLS, GMM, and the methods of Olley-Pakes and Ackerberg, Caves, and Frazer (ACF).

# Python: Simulate Cobb-Douglas Production Function with AR(1) Productivity
import numpy as np
import pandas as pd

np.random.seed(42)
N = 500    # firms
T = 10     # periods

# True parameters
alpha_L = 0.6    # labor elasticity
alpha_K = 0.3    # capital elasticity
rho = 0.7        # persistence of productivity

# Generate AR(1) productivity process: omega_t = rho * omega_{t-1} + eta_t
omega = np.zeros((N, T))
omega[:, 0] = np.random.normal(0, 1, N)
for t in range(1, T):
    omega[:, t] = rho * omega[:, t-1] + np.random.normal(0, 0.3, N)

# Generate inputs (correlated with productivity — endogeneity!)
labor = np.exp(1.5 + 0.5 * omega + np.random.normal(0, 0.3, (N, T)))
capital = np.exp(2.0 + 0.3 * omega + np.random.normal(0, 0.4, (N, T)))

# Output: Y = L^alpha_L * K^alpha_K * exp(omega + epsilon)
epsilon = np.random.normal(0, 0.2, (N, T))
output = labor**alpha_L * capital**alpha_K * np.exp(omega + epsilon)

# Reshape to panel
df = pd.DataFrame({
    'firm': np.repeat(range(N), T),
    'year': np.tile(range(T), N),
    'lnY': np.log(output).flatten(),
    'lnL': np.log(labor).flatten(),
    'lnK': np.log(capital).flatten()
})

# OLS on log-linearized model (biased due to endogeneity)
import statsmodels.formula.api as smf
model_ols = smf.ols('lnY ~ lnL + lnK', data=df).fit()
print(model_ols.summary().tables[1])
print(f"\nTrue: alpha_L={alpha_L}, alpha_K={alpha_K}")
print(f"OLS:  alpha_L={model_ols.params['lnL']:.3f}, alpha_K={model_ols.params['lnK']:.3f}")

* Stata: Simulate Cobb-Douglas Production Function with AR(1) Productivity
clear
set seed 42
set obs 5000

* Panel structure: 500 firms x 10 periods
gen firm = ceil(_n / 10)
bysort firm: gen year = _n

* True parameters
scalar alpha_L = 0.6
scalar alpha_K = 0.3
scalar rho = 0.7

* Generate AR(1) productivity
gen omega = rnormal(0, 1) if year == 1
bysort firm (year): replace omega = rho * omega[_n-1] + rnormal(0, 0.3) if year > 1

* Generate inputs (endogenous: correlated with omega)
gen lnL = 1.5 + 0.5 * omega + rnormal(0, 0.3)
gen lnK = 2.0 + 0.3 * omega + rnormal(0, 0.4)

* Generate output
gen epsilon = rnormal(0, 0.2)
gen lnY = alpha_L * lnL + alpha_K * lnK + omega + epsilon

* OLS (biased upward due to endogeneity)
reg lnY lnL lnK
display "True: alpha_L = " alpha_L ", alpha_K = " alpha_K

# R: Simulate Cobb-Douglas Production Function with AR(1) Productivity
set.seed(42)
N <- 500   # firms
TT <- 10   # periods (avoid masking base T)

# True parameters
alpha_L <- 0.6   # labor elasticity
alpha_K <- 0.3   # capital elasticity
rho <- 0.7       # productivity persistence

# Generate AR(1) productivity process
omega <- matrix(0, N, TT)
omega[, 1] <- rnorm(N)
for (t in 2:TT) {
  omega[, t] <- rho * omega[, t - 1] + rnorm(N, 0, 0.3)
}

# Inputs correlated with productivity (endogeneity!)
lnL <- 1.5 + 0.5 * omega + matrix(rnorm(N * TT, 0, 0.3), N, TT)
lnK <- 2.0 + 0.3 * omega + matrix(rnorm(N * TT, 0, 0.4), N, TT)

# Output: log(Y) = alpha_L * log(L) + alpha_K * log(K) + omega + epsilon
epsilon <- matrix(rnorm(N * TT, 0, 0.2), N, TT)
lnY <- alpha_L * lnL + alpha_K * lnK + omega + epsilon

# Reshape to panel data frame
df <- data.frame(
  firm = rep(1:N, each = TT),
  year = rep(1:TT, times = N),
  lnY  = as.vector(t(lnY)),
  lnL  = as.vector(t(lnL)),
  lnK  = as.vector(t(lnK))
)

# OLS on log-linearized model (biased: transmission bias)
model_ols <- lm(lnY ~ lnL + lnK, data = df)
cat("True:  alpha_L =", alpha_L, " alpha_K =", alpha_K, "\n")
cat("OLS:   alpha_L =", round(coef(model_ols)["lnL"], 3),
    " alpha_K =", round(coef(model_ols)["lnK"], 3), "\n")
cat("OLS estimates are biased upward due to endogeneity\n")

Python Output

>>> print(model_ols.summary().tables[1]) ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.2346 0.045 -5.188 0.000 -0.323 -0.146 lnL 0.723 0.012 58.599 0.000 0.699 0.748 lnK 0.412 0.010 41.777 0.000 0.393 0.432 ============================================================================== >>> print(f"True: alpha_L={alpha_L}, alpha_K={alpha_K}") True: alpha_L=0.6, alpha_K=0.3 >>> print(f"OLS: alpha_L={model_ols.params['lnL']:.3f}, alpha_K={model_ols.params['lnK']:.3f}") OLS: alpha_L=0.723, alpha_K=0.412

Stata Output

. reg lnY lnL lnK Source | SS df MS Number of obs = 5,000 -------------+---------------------------------- F(2, 4997) = 12841.23 Model | 3421.23456 2 1710.61728 Prob > F = 0.0000 Residual | 665.34567 4997 .13323456 R-squared = 0.8371 -------------+---------------------------------- Adj R-squared = 0.8370 Total | 4086.58023 4999 .81745345 Root MSE = .36504 ------------------------------------------------------------------------------ lnY | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- lnL | .7231234 .0123456 58.60 0.000 .6989234 .7473234 lnK | .4123456 .0098765 41.75 0.000 .3929876 .4317036 _cons | -.2345678 .0452100 -5.19 0.000 -.3231876 -.1459480 ------------------------------------------------------------------------------ . display "True: alpha_L = " alpha_L ", alpha_K = " alpha_K True: alpha_L = .6, alpha_K = .3

R Output

True: alpha_L = 0.6 alpha_K = 0.3 OLS: alpha_L = 0.723 alpha_K = 0.412 OLS estimates are biased upward due to endogeneity > summary(model_ols) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.23456 0.04521 -5.188 2.34e-07 *** lnL 0.72312 0.01234 58.599 < 2e-16 *** lnK 0.41234 0.00987 41.777 < 2e-16 *** --- R-squared: 0.8734, Adjusted R-squared: 0.8733 Note: OLS overestimates both coefficients because more productive firms use more inputs (transmission bias).

Why OLS is Biased Here

More productive firms (high ω) both produce more output and choose more inputs, creating positive correlation between regressors and the error term. This "transmission bias" (or "simultaneity bias") inflates the OLS estimates of input elasticities. Solutions include the Olley-Pakes (1996) proxy method, the Levinsohn-Petrin (2003) approach, and the ACF (Ackerberg, Caves & Frazer, 2015) estimator—all of which can be implemented using the prodest R package.

5A Data Simulation

Learning Objectives

Table of Contents

5A.1 Why Simulate Data?

5A.2 Basic Data Simulation

Generating Random Variables

Simulating Correlated Variables

5A.3 Structure-Preserving Synthetic Data

Using Synthpop (R)

5A.4 LLM-Safe Data for AI Assistance

Creating LLM-Safe Sample Data

What to Include When Asking LLMs for Help

5A.5 Pre-Analysis Code Development

The Pre-Analysis Workflow

5A.6 Power Analysis Through Simulation

Simulation-Based Power Analysis

5A.7 Production Function DGP

ProTools ER1 Assistant

ProTools ER1

Course Modules

5A Data Simulation

Learning Objectives

Table of Contents

5A.1 Why Simulate Data?

5A.2 Basic Data Simulation

Generating Random Variables

Simulating Correlated Variables

5A.3 Structure-Preserving Synthetic Data

Using Synthpop (R)

5A.4 LLM-Safe Data for AI Assistance

Creating LLM-Safe Sample Data

What to Include When Asking LLMs for Help

5A.5 Pre-Analysis Code Development

The Pre-Analysis Workflow

5A.6 Power Analysis Through Simulation

Simulation-Based Power Analysis

5A.7 Production Function DGP

ProTools ER1 Assistant