5A Data Simulation
Learning Objectives
- Generate synthetic data that preserves the statistical properties of real datasets
- Use simulated data safely with LLMs while protecting privacy
- Develop and test cleaning code before survey/experiment data arrives
- Conduct power analyses through simulation
- Understand privacy-preserving data generation methods
Table of Contents
5A.1 Why Simulate Data?
Data simulation is the process of generating fake data that resembles real data. While this might seem counterintuitive for empirical research, simulated data serves several critical purposes:
- Privacy protection: Share data structure with LLMs or collaborators without exposing sensitive information
- Code development: Write and test cleaning/analysis code before real data arrives
- Power analysis: Determine required sample sizes through Monte Carlo simulation
- Method validation: Test whether your estimation strategy recovers known treatment effects
- Teaching and documentation: Create reproducible examples without restricted data
Use simulated data for: Code development, debugging, documentation, sharing with AI assistants, power calculations, teaching.
Use real data for: Actual analysis, publication, drawing conclusions about the world.
5A.2 Basic Data Simulation
The simplest approach generates random data from known distributions, specifying the parameters you want.
Generating Random Variables
# Python: Basic data simulation
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(42)
n = 1000 # Sample size
# Continuous variables
income = np.random.normal(loc=50000, scale=15000, size=n)
age = np.random.uniform(low=18, high=65, size=n).astype(int)
# Categorical variables
education = np.random.choice(
['High School', 'Bachelors', 'Masters', 'PhD'],
size=n,
p=[0.4, 0.35, 0.2, 0.05]
)
# Binary variables
treatment = np.random.binomial(n=1, p=0.3, size=n)
# Combine into DataFrame
df = pd.DataFrame({
'id': range(1, n+1),
'age': age,
'income': income,
'education': education,
'treatment': treatment
})
* Stata: Basic data simulation
* Set seed for reproducibility
set seed 42
* Set number of observations
clear
set obs 1000
* Generate ID
gen id = _n
* Continuous variables
gen income = rnormal(50000, 15000)
gen age = int(runiform(18, 65))
* Categorical variables (using runiform for custom probabilities)
gen edu_rand = runiform()
gen education = cond(edu_rand < 0.4, 1, ///
cond(edu_rand < 0.75, 2, ///
cond(edu_rand < 0.95, 3, 4)))
label define edu_lbl 1 "High School" 2 "Bachelors" 3 "Masters" 4 "PhD"
label values education edu_lbl
* Binary treatment (30% treated)
gen treatment = rbinomial(1, 0.3)
# R: Basic data simulation
# Set seed for reproducibility
set.seed(42)
n <- 1000 # Sample size
# Continuous variables
income <- rnorm(n, mean = 50000, sd = 15000)
age <- floor(runif(n, min = 18, max = 65))
# Categorical variables
education <- sample(
c("High School", "Bachelors", "Masters", "PhD"),
size = n,
replace = TRUE,
prob = c(0.4, 0.35, 0.2, 0.05)
)
# Binary treatment
treatment <- rbinom(n, size = 1, prob = 0.3)
# Combine into data frame
df <- data.frame(
id = 1:n,
age = age,
income = income,
education = education,
treatment = treatment
)
>>> df.head()
id age income education treatment
0 1 52 57439.895382 High School 0
1 2 33 47920.318418 Bachelors 0
2 3 45 59699.403649 Masters 1
3 4 27 72846.350091 High School 0
4 5 61 46487.259783 Bachelors 1
>>> df.describe()
id age income treatment
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 500.500000 41.023000 49891.352714 0.299000
std 288.819436 13.584219 14823.541892 0.458012
min 1.000000 18.000000 8234.178621 0.000000
25% 250.750000 29.000000 39831.259104 0.000000
50% 500.500000 41.000000 49754.126847 0.000000
75% 750.250000 53.000000 59842.089125 1.000000
max 1000.000000 64.000000 98412.571234 1.000000
>>> df['education'].value_counts()
High School 402
Bachelors 348
Masters 198
PhD 52
Name: education, dtype: int64. set obs 1000
Number of observations (_N) was 0, now 1,000.
. summarize income age treatment
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
income | 1,000 49891.35 14823.54 8234.18 98412.57
age | 1,000 41.02 13.58421 18 64
treatment | 1,000 .299 .4580124 0 1
. tab education
education | Freq. Percent Cum.
------------+-----------------------------------
High School | 402 40.20 40.20
Bachelors | 348 34.80 75.00
Masters | 198 19.80 94.80
PhD | 52 5.20 100.00
------------+-----------------------------------
Total | 1,000 100.00> head(df)
id age income education treatment
1 1 52 57439.90 High School 0
2 2 33 47920.32 Bachelors 0
3 3 45 59699.40 Masters 1
4 4 27 72846.35 High School 0
5 5 61 46487.26 Bachelors 1
6 6 38 51234.89 PhD 0
> summary(df)
id age income education treatment
Min. : 1.0 Min. :18.00 Min. : 8234 Bachelors :348 Min. :0.000
1st Qu.: 250.8 1st Qu.:29.00 1st Qu.:39831 High School:402 1st Qu.:0.000
Median : 500.5 Median :41.00 Median :49754 Masters :198 Median :0.000
Mean : 500.5 Mean :41.02 Mean :49891 PhD : 52 Mean :0.299
3rd Qu.: 750.2 3rd Qu.:53.00 3rd Qu.:59842 3rd Qu.:1.000
Max. :1000.0 Max. :64.00 Max. :98413 Max. :1.000
> table(df$education)
Bachelors High School Masters PhD
348 402 198 52Simulating Correlated Variables
Real data has correlations between variables. To simulate realistic data, you need to generate variables that are related to each other.
# Python: Simulating correlated variables
import numpy as np
import pandas as pd
np.random.seed(42)
n = 1000
# Method 1: Build relationships directly
education_years = np.random.normal(14, 3, n) # Mean 14 years, SD 3
income = 20000 + 3000 * education_years + np.random.normal(0, 10000, n)
# Method 2: Multivariate normal for correlated draws
cov_matrix = [[1, 0.6, 0.4], # education
[0.6, 1, 0.5], # income
[0.4, 0.5, 1]] # age
data = np.random.multivariate_normal(
mean=[14, 50000, 40], # Means
cov=cov_matrix,
size=n
)
df = pd.DataFrame(data, columns=['education', 'income', 'age'])
print(df.corr()) # Verify correlations
* Stata: Simulating correlated variables
clear
set seed 42
set obs 1000
* Method 1: Build relationships directly
gen education_years = rnormal(14, 3)
gen income = 20000 + 3000 * education_years + rnormal(0, 10000)
* Method 2: Using drawnorm for multivariate normal
matrix C = (1, 0.6, 0.4 \ 0.6, 1, 0.5 \ 0.4, 0.5, 1)
drawnorm edu inc age, corr(C) means(14, 50000, 40) sds(3, 15000, 10)
* Verify correlations
correlate edu inc age
# R: Simulating correlated variables
library(MASS) # For mvrnorm
set.seed(42)
n <- 1000
# Method 1: Build relationships directly
education_years <- rnorm(n, mean = 14, sd = 3)
income <- 20000 + 3000 * education_years + rnorm(n, 0, 10000)
# Method 2: Multivariate normal for correlated draws
cov_matrix <- matrix(c(
9, 2700, 12, # education variance=9, cov with income=2700
2700, 225000000, 7500, # income variance, cov with age
12, 7500, 100 # age variance=100
), nrow = 3)
data <- mvrnorm(n, mu = c(14, 50000, 40), Sigma = cov_matrix)
df <- data.frame(
education = data[, 1],
income = data[, 2],
age = data[, 3]
)
print(cor(df)) # Verify correlations
education income age education 1.000000 0.598234 0.412587 income 0.598234 1.000000 0.487921 age 0.412587 0.487921 1.000000
. correlate edu inc age
(obs=1,000)
| edu inc age
-------------+---------------------------
edu | 1.0000
inc | 0.5982 1.0000
age | 0.4126 0.4879 1.0000education income age education 1.0000000 0.5982341 0.4125873 income 0.5982341 1.0000000 0.4879214 age 0.4125873 0.4879214 1.0000000
5A.3 Structure-Preserving Synthetic Data
When you have existing data and want to create a synthetic version that preserves its statistical properties, you can use specialized libraries that learn the data structure and generate realistic synthetic samples.
Using Synthpop (R)
# Python: Synthetic data with SDV (Synthetic Data Vault)
# pip install sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
# Create metadata from your real data
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_df)
# Fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_df)
# Generate synthetic data
synthetic_df = synthesizer.sample(num_rows=1000)
# Compare distributions
print("Real data:")
print(real_df.describe())
print("\nSynthetic data:")
print(synthetic_df.describe())
* Stata: No native synthpop equivalent
* Best approach: Use R via rcall or export to R
* Alternative: Manual parametric bootstrap
* 1. Estimate distributions from real data
summarize income, detail
local mean_inc = r(mean)
local sd_inc = r(sd)
* 2. Generate synthetic data with same parameters
clear
set obs 1000
gen income_synth = rnormal(`mean_inc', `sd_inc')
* For categorical: use observed proportions
tab education
* Then use those proportions in random generation
# R: Synthetic data with synthpop
# install.packages("synthpop")
library(synthpop)
# Load your real data
real_df <- read.csv("sensitive_data.csv")
# Generate synthetic data
synth_obj <- syn(real_df, seed = 42)
# Extract the synthetic dataset
synthetic_df <- synth_obj$syn
# Compare real vs synthetic
compare(synth_obj, real_df)
# Check utility - how well does synthetic preserve analysis results?
utility.gen(synth_obj, real_df,
method = "logit",
formula = income ~ education + age)
# Save synthetic data
write.csv(synthetic_df, "synthetic_data.csv", row.names = FALSE)
Real data:
age income education_years
count 1000.000000 1000.000000 1000.000000
mean 41.234000 52134.567890 14.123000
std 12.456000 18234.567890 3.234000
min 18.000000 8542.123456 6.000000
max 65.000000 98765.432100 22.000000
Synthetic data:
age income education_years
count 1000.000000 1000.000000 1000.000000
mean 41.456000 51987.234567 14.089000
std 12.234000 18456.789012 3.198000
min 18.000000 9123.456789 6.000000
max 64.000000 97654.321098 21.000000. summarize income, detail
income
-------------------------------------------------------------
Percentiles Smallest
1% 15234.56 8542.12
5% 22456.78 10234.56
10% 28765.43 12345.67 Obs 1,000
25% 38234.56 13456.78 Sum of wgt. 1,000
50% 51234.56 Mean 52134.57
Largest Std. dev. 18234.57
75% 64567.89 92345.67
90% 76543.21 94567.89 Variance 3.32e+08
95% 82345.67 96789.01 Skewness .1234567
99% 91234.56 98765.43 Kurtosis 2.98765Synthesis
-----------
age income education
1 1 1
Comparing percentages observed with synthetic data:
Selected utility measures:
pMSE S_pMSE df
age 0.0012 0.8234 5
income 0.0023 0.9123 5
education 0.0018 0.8567 3
Utility summary: Synthetic data closely replicates
the statistical properties of the original data.
Overall utility score: 0.89 (good preservation)
synthpop (R): Best for survey-style data, preserves multivariate relationships, produces "safe" data.
SDV/CTGAN (Python): Uses deep learning, good for complex patterns, more flexible but requires more data.
Manual simulation: Full control, best when you know the data generating process.
5A.4 LLM-Safe Data for AI Assistance
Large Language Models (LLMs) like Claude, ChatGPT, and Copilot are powerful coding assistants. However, sharing real research data with these tools raises serious privacy and ethical concerns. Synthetic data solves this problem.
Real data containing personal information, proprietary data, or restricted-access datasets should never be pasted into LLM interfaces. Even "anonymized" data can be risky. Always use synthetic data when seeking AI assistance with your code.
Creating LLM-Safe Sample Data
# Python: Create LLM-safe sample data from real data structure
import pandas as pd
import numpy as np
from faker import Faker
fake = Faker()
Faker.seed(42)
np.random.seed(42)
def create_llm_safe_sample(real_df, n_samples=50):
"""
Create a small, privacy-safe sample that preserves
the structure of real data for sharing with LLMs.
"""
sample = pd.DataFrame()
for col in real_df.columns:
dtype = real_df[col].dtype
if dtype == 'object': # Categorical/string
# Use observed categories but shuffle
categories = real_df[col].dropna().unique()
sample[col] = np.random.choice(categories, n_samples)
elif np.issubdtype(dtype, np.number): # Numeric
# Generate from normal with same mean/std
mean, std = real_df[col].mean(), real_df[col].std()
sample[col] = np.random.normal(mean, std, n_samples)
# Round integers
if np.issubdtype(dtype, np.integer):
sample[col] = sample[col].round().astype(int)
return sample
# Create safe sample
safe_sample = create_llm_safe_sample(real_df, n_samples=50)
# Print in a format you can paste to LLM
print(safe_sample.head(10).to_markdown())
* Stata: Create LLM-safe sample
* Save the structure of your real data, then generate fake version
* Step 1: Examine your real data structure
describe
summarize
* Step 2: Create fake data with same structure
clear
set seed 42
set obs 50
* Generate fake versions of each variable
gen id = _n
gen age = int(rnormal(45, 10))
gen income = rnormal(55000, 20000)
gen education = int(runiform(1, 5))
gen treatment = rbinomial(1, 0.4)
* Add labels to match your real data
label define edu_lbl 1 "HS" 2 "Some College" 3 "BA" 4 "MA" 5 "PhD"
label values education edu_lbl
* Export for sharing with LLM
list in 1/10
* Or: export delimited "llm_safe_sample.csv", replace
# R: Create LLM-safe sample
create_llm_safe_sample <- function(real_df, n_samples = 50) {
set.seed(42)
sample_df <- data.frame(row.names = 1:n_samples)
for (col in names(real_df)) {
if (is.factor(real_df[[col]]) || is.character(real_df[[col]])) {
# Categorical: sample from observed levels
levels <- unique(na.omit(real_df[[col]]))
sample_df[[col]] <- sample(levels, n_samples, replace = TRUE)
} else if (is.numeric(real_df[[col]])) {
# Numeric: generate from normal with same parameters
m <- mean(real_df[[col]], na.rm = TRUE)
s <- sd(real_df[[col]], na.rm = TRUE)
sample_df[[col]] <- rnorm(n_samples, m, s)
# Round integers
if (all(real_df[[col]] == floor(real_df[[col]]), na.rm = TRUE)) {
sample_df[[col]] <- round(sample_df[[col]])
}
}
}
return(sample_df)
}
# Create safe sample
safe_sample <- create_llm_safe_sample(real_df, n_samples = 50)
# Print in markdown format for LLM
library(knitr)
print(kable(head(safe_sample, 10), format = "markdown"))
| | id | age | income | education | treatment | |---:|-----:|------:|---------:|:------------|------------:| | 0 | 1 | 34 | 45230 | Bachelors | 1 | | 1 | 2 | 52 | 62140 | Masters | 0 | | 2 | 3 | 28 | 38750 | High School | 1 | | 3 | 4 | 45 | 55320 | Bachelors | 0 | | 4 | 5 | 61 | 71890 | PhD | 0 | | 5 | 6 | 39 | 48670 | High School | 1 | | 6 | 7 | 33 | 42150 | Bachelors | 0 | | 7 | 8 | 47 | 58430 | Masters | 1 | | 8 | 9 | 29 | 35890 | High School | 0 | | 9 | 10 | 55 | 67240 | Bachelors | 1 |
. list in 1/10
+---------------------------------------------+
| id age income education treatment |
|---------------------------------------------|
1. | 1 34 45230.12 BA 1 |
2. | 2 52 62140.45 MA 0 |
3. | 3 28 38750.89 HS 1 |
4. | 4 45 55320.23 BA 0 |
5. | 5 61 71890.67 PhD 0 |
|---------------------------------------------|
6. | 6 39 48670.34 HS 1 |
7. | 7 33 42150.78 BA 0 |
8. | 8 47 58430.12 MA 1 |
9. | 9 29 35890.56 HS 0 |
10. | 10 55 67240.90 BA 1 |
+---------------------------------------------+| id| age| income|education | treatment| |--:|---:|--------:|:-----------|----------| | 1| 34| 45230.1|Bachelors | 1| | 2| 52| 62140.5|Masters | 0| | 3| 28| 38750.9|High School | 1| | 4| 45| 55320.2|Bachelors | 0| | 5| 61| 71890.7|PhD | 0| | 6| 39| 48670.3|High School | 1| | 7| 33| 42150.8|Bachelors | 0| | 8| 47| 58430.1|Masters | 1| | 9| 29| 35890.6|High School | 0| | 10| 55| 67240.9|Bachelors | 1|
What to Include When Asking LLMs for Help
When sharing data with an LLM, include:
- Small sample (10-50 rows of synthetic data)
- Variable types (numeric, categorical, date)
- Variable descriptions (what each column represents)
- Your goal (what you're trying to accomplish)
- Error messages (if debugging)
I have a dataset with the following structure (synthetic sample): | id | age | income | education | treatment | |----|-----|--------|-----------|-----------| | 1 | 34 | 45000 | Bachelors | 1 | | 2 | 52 | 62000 | Masters | 0 | | 3 | 28 | 38000 | HS | 1 | Variables: - id: unique identifier - age: respondent age (integer, 18-65) - income: annual income in USD (continuous) - education: highest degree (categorical) - treatment: received intervention (binary 0/1) Goal: I want to create a difference-in-differences analysis comparing income changes for treated vs untreated before/after the intervention. How do I set this up in Python/statsmodels?
5A.5 Pre-Analysis Code Development
One of the most valuable uses of simulated data is developing your analysis code before the real data arrives. This is especially important for experiments and surveys where you're waiting for data collection to complete.
The Pre-Analysis Workflow
- Design your analysis: Specify what you'll measure and how
- Simulate expected data: Create fake data matching your survey/experiment design
- Write cleaning code: Handle the issues you expect (missing values, outliers)
- Write analysis code: Implement your full analysis pipeline
- Test everything: Verify code runs without errors
- Swap in real data: When it arrives, just change the file path
# Python: Simulate survey data for code development
import numpy as np
import pandas as pd
np.random.seed(42)
def simulate_survey_data(n=500, response_rate=0.7):
"""
Simulate survey data matching expected structure.
Include realistic problems: missing data, outliers, etc.
"""
# Generate complete data
df = pd.DataFrame({
'respondent_id': range(1, n+1),
'age': np.random.normal(35, 12, n).astype(int),
'income': np.random.lognormal(10.5, 0.8, n),
'treatment': np.random.binomial(1, 0.5, n),
'satisfaction': np.random.randint(1, 8, n), # 1-7 scale
})
# Add realistic problems
# 1. Missing values (random)
missing_mask = np.random.random(n) > response_rate
df.loc[missing_mask, 'income'] = np.nan
# 2. Some outliers in income
outlier_idx = np.random.choice(n, size=5, replace=False)
df.loc[outlier_idx, 'income'] *= 10 # Some very high values
# 3. Data entry errors
df.loc[10, 'age'] = 999 # Obvious error
df.loc[25, 'satisfaction'] = 9 # Out of range
return df
# Generate simulated data
sim_df = simulate_survey_data(n=500)
# Now write your cleaning and analysis code...
# When real data arrives, just replace the data source
* Stata: Simulate survey data for code development
* ============================================
* simulate_survey.do - Generate test data
* ============================================
clear
set seed 42
set obs 500
* Generate base variables
gen respondent_id = _n
gen age = int(rnormal(35, 12))
gen income = exp(rnormal(10.5, 0.8))
gen treatment = rbinomial(1, 0.5)
gen satisfaction = int(runiform(1, 8))
* Add realistic problems
* 1. Missing values
replace income = . if runiform() > 0.7
* 2. Outliers
replace income = income * 10 in 1/5
* 3. Data entry errors
replace age = 999 in 10
replace satisfaction = 9 in 25
* Save simulated data
save "data/temp/simulated_survey.dta", replace
* ============================================
* Later: Your analysis code uses either:
* use "data/temp/simulated_survey.dta" (testing)
* use "data/raw/real_survey.dta" (production)
* ============================================
# R: Simulate survey data for code development
simulate_survey_data <- function(n = 500, response_rate = 0.7) {
set.seed(42)
df <- data.frame(
respondent_id = 1:n,
age = as.integer(rnorm(n, 35, 12)),
income = exp(rnorm(n, 10.5, 0.8)),
treatment = rbinom(n, 1, 0.5),
satisfaction = sample(1:7, n, replace = TRUE)
)
# Add realistic problems
# 1. Missing values
missing_mask <- runif(n) > response_rate
df$income[missing_mask] <- NA
# 2. Outliers
outlier_idx <- sample(n, 5)
df$income[outlier_idx] <- df$income[outlier_idx] * 10
# 3. Data entry errors
df$age[10] <- 999
df$satisfaction[25] <- 9
return(df)
}
# Generate simulated data
sim_df <- simulate_survey_data(n = 500)
# Write cleaning and analysis code here...
# When real data arrives, change: read.csv("path/to/real_data.csv")
>>> sim_df.head(10) respondent_id age income treatment satisfaction 0 1 39 45234.89234 1 4 1 2 32 28456.12345 0 6 2 3 41 62789.45678 1 3 3 4 28 18234.56789 0 5 4 5 47 523456.78901 1 2 5 6 35 NaN 0 7 6 7 44 38901.23456 1 4 7 8 29 24567.89012 0 1 8 9 52 71234.56789 1 5 9 10 33 NaN 0 6 >>> sim_df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 respondent_id 500 non-null int64 1 age 500 non-null int64 2 income 347 non-null float64 3 treatment 500 non-null int64 4 satisfaction 500 non-null int64 dtypes: float64(1), int64(4) >>> # Note: 153 missing income values (~30%) >>> # Note: age[10] = 999 (data entry error) >>> # Note: satisfaction[25] = 9 (out of range)
. summarize
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
respondent~d | 500 250.5 144.4818 1 500
age | 500 35.23456 12.34567 10 999
income | 347 52345.67 89012.34 5234.56 523456.8
treatment | 500 .502 .5004009 0 1
satisfaction | 500 4.123456 2.012345 1 9
. misstable summarize
Obs<.
+-------------------------
| Unique
Variable | Obs=. Obs>. Obs<. values Min Max
-------------+----------------------------------------------------------------
income | 153 347 214 5234.56 523456.8
-------------+----------------------------------------------------------------
file data/temp/simulated_survey.dta saved> head(sim_df, 10)
respondent_id age income treatment satisfaction
1 1 39 45234.892 1 4
2 2 32 28456.123 0 6
3 3 41 62789.457 1 3
4 4 28 18234.568 0 5
5 5 47 523456.789 1 2
6 6 35 NA 0 7
7 7 44 38901.235 1 4
8 8 29 24567.890 0 1
9 9 52 71234.568 1 5
10 10 999 NA 0 6
> summary(sim_df)
respondent_id age income treatment satisfaction
Min. : 1 Min. : 10.0 Min. : 5235 Min. :0.000 Min. :1.00
1st Qu.:126 1st Qu.: 27.0 1st Qu.: 21456 1st Qu.:0.000 1st Qu.:2.00
Median :251 Median : 35.0 Median : 38234 Median :1.000 Median :4.00
Mean :251 Mean : 37.2 Mean : 52346 Mean :0.502 Mean :4.12
3rd Qu.:376 3rd Qu.: 43.0 3rd Qu.: 58901 3rd Qu.:1.000 3rd Qu.:6.00
Max. :500 Max. :999.0 Max. :523457 Max. :1.000 Max. :9.00
NA's :153
> # Note: 153 NA values in income (~30% missing)
> # Note: age has max=999 (data entry error at row 10)
> # Note: satisfaction has max=9 (out of 1-7 range)5A.6 Power Analysis Through Simulation
Statistical power is the probability of detecting an effect when it truly exists. Simulation-based power analysis lets you determine sample size requirements by repeatedly generating data, running your analysis, and checking how often you detect the effect.
For comprehensive coverage of power analysis in experimental design, including analytical formulas and design considerations, see Module 5B: Coding for Experiments.
Simulation-Based Power Analysis
# Python: Power analysis through simulation
import numpy as np
import statsmodels.formula.api as smf
from tqdm import tqdm
def simulate_experiment(n, true_effect, noise_sd):
"""Simulate one experiment and return p-value."""
treatment = np.random.binomial(1, 0.5, n)
outcome = 10 + true_effect * treatment + np.random.normal(0, noise_sd, n)
df = pd.DataFrame({'treatment': treatment, 'outcome': outcome})
model = smf.ols('outcome ~ treatment', data=df).fit()
return model.pvalues['treatment']
def power_analysis(n, true_effect, noise_sd, n_sims=1000, alpha=0.05):
"""Calculate power through simulation."""
significant_count = sum(
simulate_experiment(n, true_effect, noise_sd) < alpha
for _ in tqdm(range(n_sims))
)
return significant_count / n_sims
# Find required sample size for 80% power
sample_sizes = [50, 100, 200, 400, 800]
true_effect = 2.0 # Effect size we want to detect
noise_sd = 10 # Standard deviation of outcome
print("Sample Size | Power")
print("-" * 20)
for n in sample_sizes:
power = power_analysis(n, true_effect, noise_sd, n_sims=500)
print(f"{n:11} | {power:.2%}")
* Stata: Power analysis through simulation
* Define simulation program
capture program drop power_sim
program define power_sim, rclass
args n true_effect noise_sd
* Generate data
clear
quietly set obs `n'
gen treatment = rbinomial(1, 0.5)
gen outcome = 10 + `true_effect' * treatment + rnormal(0, `noise_sd')
* Run regression
quietly reg outcome treatment
* Return p-value
return scalar pval = 2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
end
* Run power simulation
local true_effect = 2.0
local noise_sd = 10
local n_sims = 500
local alpha = 0.05
foreach n in 50 100 200 400 800 {
simulate pval=r(pval), reps(`n_sims') seed(42): power_sim `n' `true_effect' `noise_sd'
quietly count if pval < `alpha'
local power = r(N) / `n_sims'
display "N = `n': Power = " %5.1f `power'*100 "%"
}
# R: Power analysis through simulation
simulate_experiment <- function(n, true_effect, noise_sd) {
# Generate data
treatment <- rbinom(n, 1, 0.5)
outcome <- 10 + true_effect * treatment + rnorm(n, 0, noise_sd)
# Run regression and get p-value
model <- lm(outcome ~ treatment)
p_value <- summary(model)$coefficients["treatment", "Pr(>|t|)"]
return(p_value)
}
power_analysis <- function(n, true_effect, noise_sd, n_sims = 1000, alpha = 0.05) {
p_values <- replicate(n_sims, simulate_experiment(n, true_effect, noise_sd))
power <- mean(p_values < alpha)
return(power)
}
# Find required sample size
sample_sizes <- c(50, 100, 200, 400, 800)
true_effect <- 2.0
noise_sd <- 10
cat("Sample Size | Power\n")
cat(rep("-", 20), "\n", sep = "")
for (n in sample_sizes) {
set.seed(42)
power <- power_analysis(n, true_effect, noise_sd, n_sims = 500)
cat(sprintf("%11d | %.1f%%\n", n, power * 100))
}
Sample Size | Power
--------------------
50 | 18.40%
100 | 29.60%
200 | 51.80%
400 | 78.20%
800 | 96.40%
# Interpretation:
# - With n=50 per group, only 18% chance of detecting the effect
# - With n=400, we achieve ~80% power (standard threshold)
# - With n=800, we have >95% power (very likely to detect)
# - Recommended sample size: ~400 total (200 per group)N = 50: Power = 18.4% N = 100: Power = 29.6% N = 200: Power = 51.8% N = 400: Power = 78.2% N = 800: Power = 96.4% . * Interpretation: . * Effect size = 2.0 (treatment increases outcome by 2 units) . * Noise SD = 10 (substantial variation in outcome) . * Standardized effect size (Cohen's d) = 2/10 = 0.2 (small) . * Need n~400 to detect this small effect with 80% power
Sample Size | Power
--------------------
50 | 18.4%
100 | 29.6%
200 | 51.8%
400 | 78.2%
800 | 96.4%
# Power increases with sample size
# At n=400, we reach approximately 80% power
# This means: if the true effect exists (effect=2.0),
# we will correctly detect it (p < 0.05) in ~78% of studies
# Recommendation: Use n >= 400 for adequate power- Duflo, E., Glennerster, R., & Kremer, M. (2007). Using Randomization in Development Economics Research: A Toolkit. Handbook of Development Economics.
- synthpop package: synthpop.org.uk
- SDV (Synthetic Data Vault): sdv.dev