7 Estimation Methods
Learning Objectives
- Distinguish between identification and estimation
- Choose appropriate standard errors for your data structure
- Handle panel data with fixed effects efficiently
- Implement nonlinear models (logit, probit, Poisson)
- Understand maximum likelihood and GMM estimation
Table of Contents
7.1 Identification vs. Estimation
A common source of confusion: identification and estimation are different problems.
Identification (Module 6)
Can we learn the causal effect from data?
- Research design and assumptions
- What variation identifies the effect?
- Parallel trends, exclusion restriction, etc.
- Whether your estimate is causal or just correlational
Estimation (This Module)
How do we compute the estimate from data?
- OLS, MLE, GMM, nonparametric methods
- Standard errors and inference
- Handling large datasets efficiently
- Correcting for clustering, heteroskedasticity
You can have correct identification but bad estimation (wrong standard errors), or good estimation but no identification (precise estimate of a biased number). Both matter.
7.2 Standard Errors and Inference
Choosing the right standard errors is essential for valid inference. Using incorrect standard errors can lead to dramatically wrong conclusions—confidence intervals that are too narrow (false precision) or too wide (lost power). The two main considerations are:
- Heteroskedasticity: When the variance of the error term varies across observations (e.g., income variance differs by wealth level)
- Clustering: When observations are correlated within groups (e.g., students in the same classroom, repeated observations of the same person)
Robust (Heteroskedasticity-Consistent) Standard Errors
Classical OLS standard errors assume homoskedasticity: constant error variance across all observations. In practice, this rarely holds. Robust standard errors (also called Huber-White or sandwich standard errors) remain valid under heteroskedasticity without requiring you to model its form.
The key parameter is cov_type='HC1' (Python), robust or vce(robust) (Stata), or vcov = vcovHC() (R). HC0-HC3 are variants with different small-sample corrections—HC1 is the most common. In the output, compare standard errors: robust SEs are often larger than classical OLS SEs when heteroskedasticity is present, leading to wider confidence intervals and more conservative inference.
In Python's statsmodels, you request robust standard errors by passing the cov_type argument to the .fit() method. Without this argument, .fit() uses classical (homoskedastic) standard errors. The argument name cov_type stands for "covariance type" — it tells the estimator which variance-covariance matrix formula to use when computing standard errors. The value 'HC1' selects the Huber-White sandwich estimator with a small-sample correction factor of n/(n-k), matching Stata's ,robust option exactly. The key thing to remember: coefficients stay identical regardless of cov_type — only the standard errors change.
The four HC variants differ only in their small-sample corrections: HC0 applies no correction (biased downward in small samples), HC1 applies n/(n-k) (matches Stata's ,robust), HC2 adjusts by leverage (better when some observations are influential), and HC3 applies the strongest correction (jackknife-like, conservative). For most applied work, HC1 is the standard choice.
# Python: Robust Standard Errors
import statsmodels.formula.api as smf
# Default (homoskedastic) standard errors
model = smf.ols('outcome ~ treatment + controls', data=df).fit()
# Robust (heteroskedasticity-consistent) HC1
model_robust = smf.ols('outcome ~ treatment + controls', data=df).fit(
cov_type='HC1'
)
print(model_robust.summary())
* Stata: Robust Standard Errors
* Default standard errors
reg outcome treatment controls
* Robust (heteroskedasticity-consistent) - ALWAYS use this
reg outcome treatment controls, robust
* Equivalent: vce(robust)
reg outcome treatment controls, vce(robust)
# R: Robust Standard Errors
library(sandwich)
library(lmtest)
# Fit model
model <- lm(outcome ~ treatment + controls, data = df)
# Robust standard errors (HC1)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Using fixest (cleaner syntax)
library(fixest)
model <- feols(outcome ~ treatment + controls, data = df, vcov = "hetero")
summary(model)
OLS Regression Results (Robust)
==============================================================================
Dep. Variable: outcome R-squared: 0.342
Model: OLS Adj. R-squared: 0.339
Method: Least Squares F-statistic: 127.4
Date: Mon, 27 Jan 2026 Prob (F-statistic): 1.23e-47
No. Observations: 500
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.4531 0.312 7.861 0.000 1.839 3.067
treatment 1.8724 0.187 10.014 0.000 1.505 2.240
controls 0.4215 0.098 4.301 0.000 0.229 0.614
==============================================================================
Notes: Standard errors are heteroskedasticity robust (HC1)
Linear regression Number of obs = 500
F(2, 497) = 127.38
Prob > F = 0.0000
R-squared = 0.3418
Root MSE = 2.1847
------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.872431 .1869847 10.01 0.000 1.505087 2.239775
controls | .4215328 .0980127 4.30 0.000 .2289842 .6140814
_cons | 2.453124 .3120458 7.86 0.000 1.839434 3.066814
------------------------------------------------------------------------------
t test of coefficients (Heteroskedasticity-consistent):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.453124 0.312046 7.8612 2.35e-14 ***
treatment 1.872431 0.186985 10.0138 < 2.2e-16 ***
controls 0.421533 0.098013 4.3008 2.01e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clustered Standard Errors
When observations are correlated within groups, standard errors that ignore this correlation will be too small—often dramatically so. Common examples include:
- Students within schools: Students in the same school share teachers, facilities, and peer effects
- Workers within firms: Workers at the same company share management, culture, and industry shocks
- Repeated observations: The same individual observed multiple times over a panel
- Geographic clustering: Individuals in the same state/region share local policies and economic conditions
The clustering variable is specified via cov_kwds={'groups': df['cluster_var']} (Python), cluster(cluster_var) (Stata), or cluster = ~cluster_var (R/fixest). The output will report the number of clusters—a rule of thumb is that you need at least 30-50 clusters for clustered standard errors to be reliable. With few clusters, consider wild cluster bootstrap (shown in Section 7.6).
# Python: Clustered Standard Errors
import statsmodels.formula.api as smf
# Cluster at the school level
model = smf.ols('outcome ~ treatment', data=df).fit(
cov_type='cluster',
cov_kwds={'groups': df['school_id']}
)
# Two-way clustering (e.g., state and year)
from linearmodels.panel import PanelOLS
df_panel = df.set_index(['state', 'year'])
model = PanelOLS.from_formula('outcome ~ treatment', data=df_panel)
results = model.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
* Stata: Clustered Standard Errors
* One-way clustering
reg outcome treatment controls, cluster(school_id)
* Two-way clustering
* ssc install reghdfe
reghdfe outcome treatment, absorb(state year) cluster(state year)
* Check number of clusters
tab school_id, matrow(clusters)
display r(r)
# R: Clustered Standard Errors
library(fixest)
# One-way clustering
model <- feols(outcome ~ treatment + controls,
data = df,
cluster = ~ school_id)
summary(model)
# Two-way clustering
model <- feols(outcome ~ treatment | state + year,
data = df,
cluster = ~ state + year)
# With sandwich package
library(sandwich)
model <- lm(outcome ~ treatment, data = df)
vcovCL(model, cluster = ~ school_id)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.2847
Estimator: PanelOLS R-squared (Between): 0.3124
No. Observations: 2500 R-squared (Within): 0.2847
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.2956
Cov. Estimator: Clustered
F-statistic: 98.237
Entities: 50 P-value 0.0000
Time periods: 50 Distribution: F(1,2448)
Cluster Var: Entity & Time
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
treatment 2.1543 0.2174 9.9098 0.0000 1.7276 2.5810
================================================================================
Number of entity clusters: 50
Number of time clusters: 50
Linear regression Number of obs = 2,500
F(2, 49) = 65.82
Prob > F = 0.0000
R-squared = 0.2847
Root MSE = 1.9234
(Std. Err. adjusted for 50 clusters in school_id)
------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 2.154312 .2173841 9.91 0.000 1.717406 2.591218
controls | .3821456 .1124587 3.40 0.001 .1561547 .6081365
_cons | 1.287645 .4521378 2.85 0.006 .3793121 2.195978
------------------------------------------------------------------------------
Number of clusters (school_id) = 50
OLS estimation, Dep. Var.: outcome
Observations: 2,500
Standard-errors: Clustered (school_id)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.287645 0.452138 2.84791 0.006342 **
treatment 2.154312 0.217384 9.90978 1.23e-13 ***
controls 0.382146 0.112459 3.39830 0.001378 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.92344 Adj. R2: 0.28364
Std-Errors: Clustered (school_id), 50 clusters
General rule: cluster at the level where:
- Treatment is assigned (for experiments)
- There's correlation in unobservables (for observational studies)
When in doubt, cluster at a more aggregate level. See Cameron & Miller (2015) for guidance.
Comparing Models: estimates store + esttab
A common workflow is to estimate several specifications (e.g., classical SEs vs. robust SEs, or progressively adding controls) and present them side-by-side. In Stata, the pattern is: run a regression, store its results with estimates store, repeat for each model, then display them together with esttab. In Python, the Stargazer package serves a similar role; in R, modelsummary or etable from fixest.
When comparing models side-by-side, always use the same type of standard errors across all columns. Switching from classical SEs in one model to robust SEs in another makes it impossible to tell whether a change in significance is due to the added controls or the different SE type.
# Python: Compare models with Stargazer
from stargazer.stargazer import Stargazer
import statsmodels.formula.api as smf
# Model 1: baseline
m1 = smf.ols('wage ~ education', data=df).fit(cov_type='HC1')
# Model 2: add controls
m2 = smf.ols('wage ~ education + experience', data=df).fit(cov_type='HC1')
# Model 3: full specification
m3 = smf.ols('wage ~ education + experience + age', data=df).fit(cov_type='HC1')
# Side-by-side comparison
table = Stargazer([m1, m2, m3])
print(table.render_text())
* Stata: Compare models with esttab
* ssc install estout, replace
* Model 1: baseline
regress wage education, robust
estimates store baseline
* Model 2: add controls
regress wage education experience, robust
estimates store controls
* Model 3: full specification
regress wage education experience age, robust
estimates store full
* Side-by-side comparison table
esttab baseline controls full, se star(* 0.10 ** 0.05 *** 0.01)
* Export to LaTeX
esttab baseline controls full using "output/tables/results.tex", ///
se replace star(* 0.10 ** 0.05 *** 0.01)
# R: Compare models with fixest::etable or modelsummary
library(fixest)
# Estimate three specifications
m1 <- feols(wage ~ education, data = df, vcov = "hetero")
m2 <- feols(wage ~ education + experience, data = df, vcov = "hetero")
m3 <- feols(wage ~ education + experience + age, data = df, vcov = "hetero")
# Side-by-side comparison
etable(m1, m2, m3, se.below = TRUE)
# Export to LaTeX
etable(m1, m2, m3, tex = TRUE, file = "output/tables/results.tex")
=====================================================================
Dependent variable: wage
(1) (2) (3)
---------------------------------------------------------------------
education 1.247*** 1.182*** 1.174***
(0.089) (0.082) (0.083)
experience 0.534*** 0.487***
(0.067) (0.078)
age 0.052
(0.041)
Intercept 8.234*** 5.127*** 4.823***
(1.124) (1.087) (1.142)
---------------------------------------------------------------------
Observations 500 500 500
R-squared 0.284 0.371 0.373
Adj. R-squared 0.283 0.369 0.369
=====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Robust standard errors in parentheses
------------------------------------------------------------
(1) (2) (3)
wage wage wage
------------------------------------------------------------
education 1.247*** 1.182*** 1.174***
(0.089) (0.082) (0.083)
experience 0.534*** 0.487***
(0.067) (0.078)
age 0.052
(0.041)
_cons 8.234*** 5.127*** 4.823***
(1.124) (1.087) (1.142)
------------------------------------------------------------
N 500 500 500
R-sq 0.284 0.371 0.373
------------------------------------------------------------
Standard errors in parentheses
* p<0.10, ** p<0.05, *** p<0.01
m1 m2 m3
Dependent Var.: wage wage wage
education 1.247*** 1.182*** 1.174***
(0.0893) (0.0821) (0.0834)
experience 0.534*** 0.487***
(0.0672) (0.0781)
age 0.0524
(0.0412)
Constant 8.234*** 5.127*** 4.823***
(1.124) (1.087) (1.142)
___________________________ ____________ ____________ ____________
S.E. type Heteroskedasticity-robust
Observations 500 500 500
R2 0.28437 0.37082 0.37257
Adj. R2 0.28293 0.36829 0.36878
Regression Diagnostics
After estimation, diagnostic tests help verify model assumptions and identify potential problems. Three key diagnostics are: variance inflation factors (detect multicollinearity), heteroskedasticity tests (validate SE choice), and residual plots (check model specification).
# Python: Regression diagnostics
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt
model = smf.ols('wage ~ education + experience + age', data=df).fit()
# 1. Variance Inflation Factors (VIF) — detect multicollinearity
X = model.model.exog
for i, name in enumerate(model.model.exog_names):
print(f"{name}: VIF = {variance_inflation_factor(X, i):.2f}")
# 2. Breusch-Pagan test for heteroskedasticity
bp_stat, bp_pval, _, _ = het_breuschpagan(model.resid, model.model.exog)
print(f"Breusch-Pagan p-value: {bp_pval:.4f}")
# 3. Residual vs. Fitted plot — check specification
plt.scatter(model.fittedvalues, model.resid, alpha=0.3)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual vs. Fitted Plot')
plt.savefig('diagnostics.pdf', bbox_inches='tight')
* Stata: Regression diagnostics
reg wage education experience age, robust
* 1. Variance Inflation Factors
estat vif
* 2. Breusch-Pagan heteroskedasticity test
* Must re-estimate without robust for this test
quietly reg wage education experience age
estat hettest
* 3. Residual vs. Fitted plot
rvfplot, yline(0)
# R: Regression diagnostics
library(car) # for vif()
model <- lm(wage ~ education + experience + age, data = df)
# 1. Variance Inflation Factors
vif(model)
# 2. Breusch-Pagan test
library(lmtest)
bptest(model)
# 3. Residual vs. Fitted plot
plot(model, which = 1)
Intercept: VIF = 1.02 education: VIF = 2.84 experience: VIF = 7.12 age: VIF = 6.98 Breusch-Pagan p-value: 0.0023 [Residual vs. Fitted plot saved to diagnostics.pdf]
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
experience | 7.12 0.140449
age | 6.98 0.143266
education | 2.84 0.352113
-------------+----------------------
Mean VIF | 5.65
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of wage
chi2(1) = 9.34
Prob > chi2 = 0.0022
> vif(model)
education experience age
2.84 7.12 6.98
> bptest(model)
studentized Breusch-Pagan test
data: model
BP = 9.3412, df = 3, p-value = 0.02508
[Residual vs. Fitted plot displayed]
VIF > 10: Multicollinearity -- coefficients are unstable. Drop or combine correlated variables.
Breusch-Pagan p < 0.05: Heteroskedasticity detected -- use robust SEs (cov_type='HC1').
Curved residual plot: Misspecification -- try adding polynomial or log terms.
Funnel-shaped residuals: Heteroskedasticity -- confirms the need for robust SEs.
7.3 Panel Data Methods
Panel data (repeated observations of units over time) is the workhorse of applied microeconomics because it enables researchers to control for unobserved heterogeneity. The key intuition: if individuals differ in unobserved ways that affect both treatment and outcomes, cross-sectional comparisons are biased. Panel data lets us compare the same individual over time, differencing out time-invariant unobservables.
Fixed Effects
Fixed effects estimation includes a separate intercept for each unit (person, firm, country), absorbing all time-invariant characteristics. Mechanically, this is equivalent to "demeaning"—subtracting each unit's mean from all its observations. The identifying variation comes only from within-unit changes over time.
The key syntax elements are: entity_effects=True (Python linearmodels), fe or absorb(unit_id) (Stata), and | unit_id after the formula (R fixest). Two-way fixed effects add time fixed effects as well, controlling for common shocks (e.g., recessions) that affect all units.
Important: You can only estimate coefficients on variables that vary within units over time. Time-invariant characteristics (gender, race, birth cohort) are absorbed by the fixed effects and cannot be estimated—but they are controlled for.
# Python: Fixed Effects with linearmodels
from linearmodels.panel import PanelOLS
# Set panel index
df = df.set_index(['unit_id', 'year'])
# Unit fixed effects only
model = PanelOLS.from_formula(
'outcome ~ treatment + time_varying_controls',
data=df,
entity_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
# Two-way fixed effects (unit + time)
model = PanelOLS.from_formula(
'outcome ~ treatment',
data=df,
entity_effects=True,
time_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
* Stata: Fixed Effects
* Declare panel structure
xtset unit_id year
* Unit fixed effects
xtreg outcome treatment time_varying_controls, fe cluster(unit_id)
* Two-way fixed effects with reghdfe (RECOMMENDED for speed)
reghdfe outcome treatment, absorb(unit_id year) cluster(unit_id)
* With multiple sets of fixed effects
reghdfe outcome treatment, absorb(unit_id year industry_id) cluster(unit_id)
# R: Fixed Effects with fixest
library(fixest)
# Unit fixed effects
model <- feols(outcome ~ treatment + controls | unit_id,
data = df,
cluster = ~ unit_id)
summary(model)
# Two-way fixed effects
model <- feols(outcome ~ treatment | unit_id + year,
data = df,
cluster = ~ unit_id)
# Multiple sets of fixed effects
model <- feols(outcome ~ treatment | unit_id + year + industry_id,
data = df,
cluster = ~ unit_id)
# With plm package (older, slower)
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
model <- plm(outcome ~ treatment, data = pdata, model = "within")
PanelOLS Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.4218
Estimator: PanelOLS R-squared (Between): 0.0147
No. Observations: 5000 R-squared (Within): 0.4218
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.3156
Time: 14:32:17 Log-likelihood -8547.2
Cov. Estimator: Clustered
F-statistic: 182.45
Entities: 500 P-value 0.0000
Time periods: 10 Distribution: F(2,4497)
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------------
treatment 1.4872 0.1102 13.4954 0.0000 1.2712 1.7032
time_varying_ctrl 0.8234 0.0876 9.3995 0.0000 0.6516 0.9952
================================================================================
F-test for Poolability: 4.2341
P-value: 0.0000
Distribution: F(499,4497)
Included effects: Entity, Time
. xtset unit_id year
panel variable: unit_id (strongly balanced)
time variable: year, 2010 to 2019
delta: 1 unit
. xtreg outcome treatment time_varying_controls, fe cluster(unit_id)
Fixed-effects (within) regression Number of obs = 5,000
Group variable: unit_id Number of groups = 500
R-sq: Obs per group:
within = 0.4218 min = 10
between = 0.0147 avg = 10.0
overall = 0.3156 max = 10
F(2,499) = 182.45
corr(u_i, Xb) = 0.0234 Prob > F = 0.0000
(Std. Err. adjusted for 500 clusters in unit_id)
----------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
treatment | 1.487218 .1102341 13.50 0.000 1.270598 1.703838
time_varying_ctr | .8234127 .0876234 9.40 0.000 .6512847 .9955407
_cons | 2.134567 .2341876 9.11 0.000 1.674451 2.594683
-----------------+----------------------------------------------------------------
sigma_u | 1.8723456
sigma_e | 1.2345678
rho | .69712345 (fraction of variance due to u_i)
----------------------------------------------------------------------------------
OLS estimation, Dep. Var.: outcome
Observations: 5,000
Fixed-effects: unit_id: 500, year: 10
Standard-errors: Clustered (unit_id)
Estimate Std. Error t value Pr(>|t|)
treatment 1.487218 0.110234 13.49536 < 2.2e-16 ***
time_varying_ctrl 0.823413 0.087623 9.39950 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.23457 Adj. R2: 0.42018
Fixef resid. var.: 1.52416 (500 + 10 FEs)
Random Effects
Random effects (RE) treats unit-specific effects as random draws from a distribution rather than fixed parameters. This is more efficient than fixed effects (smaller standard errors) but requires a stronger assumption: the unit effects must be uncorrelated with the regressors. If this assumption fails, RE estimates are inconsistent.
In practice, RE is rarely preferred in economics research because the key assumption (unit effects uncorrelated with regressors) is hard to justify. The Hausman test compares FE and RE: if they differ significantly, RE is inconsistent and FE should be used. RE remains useful when you need to estimate coefficients on time-invariant variables, or when sample sizes are small and efficiency gains matter.
# Python: Random Effects
from linearmodels.panel import RandomEffects
df = df.set_index(['unit_id', 'year'])
model = RandomEffects.from_formula(
'outcome ~ treatment + controls',
data=df
)
results = model.fit()
# Hausman test: FE vs RE
from linearmodels.panel import compare
fe_model = PanelOLS.from_formula('outcome ~ treatment', data=df, entity_effects=True).fit()
re_model = RandomEffects.from_formula('outcome ~ treatment', data=df).fit()
print(compare({'FE': fe_model, 'RE': re_model}))
* Stata: Random Effects
* Random effects model
xtreg outcome treatment controls, re
* Hausman test
xtreg outcome treatment controls, fe
estimates store fe
xtreg outcome treatment controls, re
hausman fe .
# R: Random Effects
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
# Random effects model
re_model <- plm(outcome ~ treatment + controls,
data = pdata,
model = "random")
summary(re_model)
# Hausman test
fe_model <- plm(outcome ~ treatment + controls, data = pdata, model = "within")
phtest(fe_model, re_model)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.3524
Estimator: RandomEffects R-squared (Between): 0.4012
No. Observations: 5000 R-squared (Within): 0.3187
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.3524
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
Intercept 1.8923 0.1876 10.0865 0.0000 1.5246 2.2600
treatment 1.5234 0.0987 15.4346 0.0000 1.3299 1.7169
controls 0.7821 0.0654 11.9587 0.0000 0.6539 0.9103
================================================================================
Model Comparison
================================================================================
FE RE
--------------------------------------------------------------------------------
Dep. Variable outcome outcome
Estimator PanelOLS RandomEffects
No. Observations 5000 5000
Cov. Estimator Unadjusted Unadjusted
R-squared 0.3187 0.3524
FE RE
--------------------------------------------------------------------------------
treatment 1.4872 1.5234
(0.1102) (0.0987)
================================================================================
Hausman test statistic: 8.234
P-value: 0.0041
Conclusion: Reject H0 at 5% level - use Fixed Effects
. xtreg outcome treatment controls, re
Random-effects GLS regression Number of obs = 5,000
Group variable: unit_id Number of groups = 500
R-sq: Obs per group:
within = 0.3187 min = 10
between = 0.4012 avg = 10.0
overall = 0.3524 max = 10
Wald chi2(2) = 476.23
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.523412 .0987234 15.43 0.000 1.329919 1.716905
controls | .7821345 .0653876 11.96 0.000 .6539769 .9102921
_cons | 1.892345 .1876234 10.09 0.000 1.524612 2.260078
-------------+----------------------------------------------------------------
sigma_u | 1.4523456
sigma_e | 1.2345678
rho | .58012345 (fraction of variance due to u_i)
------------------------------------------------------------------------------
. hausman fe .
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fe . Difference S.E.
-------------+----------------------------------------------------------------
treatment | 1.487218 1.523412 -.0361943 .0412187
controls | .8234127 .7821345 .0412782 .0287654
------------------------------------------------------------------------------
b = consistent under Ho and Ha
B = inconsistent under Ha, efficient under Ho
Test: Ho: difference in coefficients not systematic
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 8.23
Prob>chi2 = 0.0041
(V_b-V_B is not positive definite)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = outcome ~ treatment + controls, data = pdata, model = "random")
Balanced Panel: n = 500, T = 10, N = 5000
Effects:
var std.dev share
idiosyncratic 1.5235 1.2343 0.42
individual 2.1098 1.4525 0.58
theta: 0.7821
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.23456 -0.82345 0.01234 0.84567 4.56789
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 1.892345 0.187623 10.086 < 2.2e-16 ***
treatment 1.523412 0.098723 15.435 < 2.2e-16 ***
controls 0.782135 0.065388 11.959 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 12345.6
Residual Sum of Squares: 7987.23
R-Squared: 0.35243
Adj. R-Squared: 0.35217
Chisq: 476.23 on 2 DF, p-value: < 2.22e-16
Hausman Test
data: outcome ~ treatment + controls
chisq = 8.234, df = 2, p-value = 0.004123
alternative hypothesis: one model is inconsistent
First-Difference Estimator
First-differencing is an alternative to the within (demeaning) transformation that eliminates unit fixed effects by subtracting the previous period's observation. It is equivalent to FE when T=2, but differs with longer panels. First-differencing is particularly useful with unit-root processes.
# Python: First-Difference Estimator
from linearmodels.panel import FirstDifferenceOLS
import pandas as pd
df = df.set_index(['unit_id', 'year'])
# First-difference estimation
model_fd = FirstDifferenceOLS.from_formula(
'outcome ~ treatment + time_varying_controls',
data=df
)
results_fd = model_fd.fit(cov_type='clustered', cluster_entity=True)
print(results_fd.summary)
* Stata: First-Difference Estimator
xtset unit_id year
* First-difference: D.y on D.x
reg D.outcome D.treatment D.controls, cluster(unit_id)
* Using xtreg, fe with first-difference explicitly
xtreg outcome treatment controls, fd cluster(unit_id)
# R: First-Difference Estimator
library(plm)
# First-difference estimation
model_fd <- plm(outcome ~ treatment + controls,
data = df,
index = c("unit_id", "year"),
model = "fd")
summary(model_fd)
# Compare FE vs FD
model_fe <- plm(outcome ~ treatment + controls,
data = df,
index = c("unit_id", "year"),
model = "within")
cat("FE estimate:", coef(model_fe)["treatment"], "\n")
cat("FD estimate:", coef(model_fd)["treatment"], "\n")
FirstDifferenceOLS Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.2134
Estimator: FirstDifferenceOLS R-squared (Between): 0.0082
No. Observations: 4500 R-squared (Within): 0.2134
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.1876
Cov. Estimator: Clustered
F-statistic: 87.234
Entities: 500 P-value 0.0000
Time periods: 9 Distribution: F(2,3997)
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
----------------------------------------------------------------------------------------
D.treatment 1.4123 0.1523 9.2731 0.0000 1.1137 1.7109
D.time_varying_ctrl 0.7845 0.1012 7.7521 0.0000 0.5861 0.9829
================================================================================
Number of entity clusters: 500
. xtset unit_id year
panel variable: unit_id (strongly balanced)
time variable: year, 2010 to 2019
delta: 1 unit
. reg D.outcome D.treatment D.controls, cluster(unit_id)
Linear regression Number of obs = 4,500
F(2, 499) = 87.23
Prob > F = 0.0000
R-squared = 0.2134
Root MSE = 1.5423
(Std. Err. adjusted for 500 clusters in unit_id)
------------------------------------------------------------------------------
| Robust
D.outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
D.treatment | 1.412312 .1523456 9.27 0.000 1.113048 1.711576
D.controls | .7845123 .1012345 7.75 0.000 .5857821 .9832425
_cons | .0234567 .0312456 0.75 0.453 -.0379125 .0848259
------------------------------------------------------------------------------
Oneway (individual) effect First-Difference Model
Call:
plm(formula = outcome ~ treatment + controls, data = df,
model = "fd", index = c("unit_id", "year"))
Balanced Panel: n = 500, T = 10, N = 5000
Observations used in estimation: 4500
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.12345 -0.78234 0.00987 0.79876 4.34567
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.023457 0.031246 0.7507 0.4530
treatment 1.412312 0.152346 9.2731 < 2.2e-16 ***
controls 0.784512 0.101235 7.7521 1.23e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 13456.7
Residual Sum of Squares: 10587.2
R-Squared: 0.21337
Adj. R-Squared: 0.21302
FE estimate: 1.4872
FD estimate: 1.4123
Hausman Test: FE vs. RE
The Hausman test compares Fixed Effects and Random Effects estimates. Under the null hypothesis, both are consistent but RE is efficient; under the alternative, RE is inconsistent (the unit effects are correlated with regressors) and only FE is consistent. A significant test favors FE.
# Python: Hausman test
from linearmodels.panel import PanelOLS, RandomEffects
from linearmodels.panel import compare
df = df.set_index(['unit_id', 'year'])
# FE model
fe = PanelOLS.from_formula('outcome ~ treatment + controls',
data=df, entity_effects=True).fit()
# RE model
re = RandomEffects.from_formula('outcome ~ treatment + controls',
data=df).fit()
# Compare: Hausman-like test
print(compare({'FE': fe, 'RE': re}))
* Stata: Hausman test
* Estimate FE and store
xtreg outcome treatment controls, fe
estimates store fe_model
* Estimate RE and store
xtreg outcome treatment controls, re
estimates store re_model
* Hausman test
hausman fe_model re_model
* Robust Hausman test (with clustered SEs)
xtoverid
# R: Hausman test
library(plm)
# Estimate both models
model_fe <- plm(outcome ~ treatment + controls,
data = df, index = c("unit_id", "year"),
model = "within")
model_re <- plm(outcome ~ treatment + controls,
data = df, index = c("unit_id", "year"),
model = "random")
# Hausman test
hausman_test <- phtest(model_fe, model_re)
print(hausman_test)
# Decision rule:
# p < 0.05 → use Fixed Effects (unit effects correlated with X)
# p > 0.05 → Random Effects is more efficient
Model Comparison
================================================================================
FE RE
--------------------------------------------------------------------------------
Dep. Variable outcome outcome
Estimator PanelOLS RandomEffects
No. Observations 5000 5000
Cov. Estimator Unadjusted Unadjusted
R-squared 0.4218 0.3524
R-squared (Between) 0.0147 0.4012
R-squared (Within) 0.4218 0.3187
R-squared (Overall) 0.3156 0.3524
FE RE
--------------------------------------------------------------------------------
treatment 1.4872 1.5234
(0.1102) (0.0987)
controls 0.8234 0.7821
(0.0876) (0.0654)
================================================================================
T-stat for Hausman: 2.869
P-value: 0.0041
. xtreg outcome treatment controls, fe
(output omitted)
. estimates store fe_model
. xtreg outcome treatment controls, re
(output omitted)
. estimates store re_model
. hausman fe_model re_model
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fe_model re_model Difference S.E.
-------------+----------------------------------------------------------------
treatment | 1.487218 1.523412 -.036194 .041219
controls | .8234127 .7821345 .041278 .028765
------------------------------------------------------------------------------
b = consistent under Ho and Ha
B = inconsistent under Ha, efficient under Ho
Test: Ho: difference in coefficients not systematic
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 8.23
Prob>chi2 = 0.0041
. xtoverid
Test of overidentifying restrictions: fixed vs random effects
Cross-section time-series model: xtreg re
Sargan-Hansen statistic 8.452 Chi-sq(2) P-value = 0.0036
Hausman Test data: outcome ~ treatment + controls chisq = 8.234, df = 2, p-value = 0.004123 alternative hypothesis: one model is inconsistent # Interpretation: # p-value = 0.004 < 0.05 # Reject H0: RE is inconsistent # Conclusion: Use Fixed Effects
Arellano-Bond: Dynamic Panel Data
When the model includes a lagged dependent variable (yt-1 on the right-hand side), fixed effects estimation is biased because the demeaned lagged dependent variable is correlated with the demeaned error. The Arellano-Bond (1991) estimator uses deeper lags of y as instruments for the differenced equation, providing consistent estimates in dynamic panels.
# Python: Arellano-Bond (using linearmodels or ivgmm)
# Note: Python support for Arellano-Bond is limited
# Manual implementation via first-difference + IV
from linearmodels.iv import IV2SLS
# First-difference the equation
df['dy'] = df.groupby('unit_id')['outcome'].diff()
df['dlag_y'] = df.groupby('unit_id')['lag_outcome'].diff()
df['dx'] = df.groupby('unit_id')['treatment'].diff()
df['lag2_y'] = df.groupby('unit_id')['outcome'].shift(2)
# 2SLS on differenced equation with lag-2 instruments
model = IV2SLS.from_formula(
'dy ~ 1 + dx + [dlag_y ~ lag2_y]',
data=df.dropna()
).fit()
print(model.summary)
* Stata: Arellano-Bond (xtabond)
xtset unit_id year
* Arellano-Bond one-step estimator
xtabond outcome treatment controls, lags(1) vce(robust)
* System GMM (Blundell-Bond): more efficient
xtdpdsys outcome treatment controls, lags(1) vce(robust)
* Arellano-Bond test for serial correlation
* AR(1) should be significant; AR(2) should NOT be
estat abond
# R: Arellano-Bond with plm
library(plm)
# Arellano-Bond one-step GMM
model_ab <- pgmm(
outcome ~ lag(outcome, 1) + treatment + controls |
lag(outcome, 2:99), # use y_{t-2} and deeper as instruments
data = df,
index = c("unit_id", "year"),
effect = "twoways",
model = "onestep"
)
summary(model_ab)
# Sargan test (overidentification)
# p > 0.05 means instruments are valid
# AR(2) test: should be insignificant
# (no second-order serial correlation in differences)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: dy R-squared: 0.1234
Estimator: IV-2SLS Adj. R-squared: 0.1198
No. Observations: 800 F-statistic: 45.67
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
dx 0.3456 0.087 3.972 0.000 0.175 0.516
dlag_y 0.4123 0.134 3.077 0.002 0.150 0.675
==============================================================================
Endogeneity test (Wu-Hausman): 8.234 p-value: 0.0041
Instrument relevance (F-stat): 34.56
Note: Manual Arellano-Bond via first-difference + IV.
Use deeper lags as instruments for the differenced lagged dependent variable.
. xtset unit_id year
panel variable: unit_id (strongly balanced)
time variable: year, 2000 to 2010
delta: 1 unit
. xtabond outcome treatment controls, lags(1) vce(robust)
Arellano-Bond dynamic panel-data estimation Number of obs = 800
Group variable: unit_id Number of groups = 100
Time variable: year
Obs per group:
min = 8
avg = 8
max = 8
Number of instruments = 30 Wald chi2(3) = 156.78
Prob > chi2 = 0.0000
One-step results
(Std. Err. adjusted for clustering on unit_id)
------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
outcome |
L1. | .4123456 .0876543 4.70 0.000 .2405453 .5841459
treatment | .3456789 .0654321 5.28 0.000 .2174344 .4739234
controls | .1234567 .0432109 2.86 0.004 .0387648 .2081486
_cons | .5678901 .2345678 2.42 0.016 .1081459 1.027634
------------------------------------------------------------------------------
. estat abond
Arellano-Bond test for zero autocorrelation in first-differenced errors
+-----------------------+
|Order | z Prob > z|
|------+----------------|
| 1 |-3.4567 0.0005 |
| 2 |-0.8765 0.3809 |
+-----------------------+
H0: no autocorrelation
AR(1): z = -3.46, p = 0.001 (expected — reject)
AR(2): z = -0.88, p = 0.381 (good — do not reject)
Onestep Generalized Method of Moments Estimation
Call:
pgmm(outcome ~ lag(outcome, 1) + treatment + controls | lag(outcome,
2:99), data = df, index = c("unit_id", "year"), effect = "twoways",
model = "onestep")
Unbalanced Panel: n = 100, T = 8-10, N = 800
Number of Observations Used: 800
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.23456 -0.87654 -0.01234 0.00123 0.89012 3.98765
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
lag(outcome, 1) 0.41235 0.09123 4.5199 6.19e-06 ***
treatment 0.34568 0.06789 5.0915 3.56e-07 ***
controls 0.12346 0.04567 2.7037 0.00686 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sargan test: chisq(26) = 28.456 (p-value = 0.3367)
Autocorrelation test (1): normal = -3.4567 (p-value = 0.00055)
Autocorrelation test (2): normal = -0.8765 (p-value = 0.38091)
Wald test for coefficients: chisq(3) = 156.78 (p-value = < 2.22e-16)
Arellano-Bond works best with "large N, small T" panels (many units, few time periods). With long panels, the proliferation of instruments can cause overfitting and weak identification. Always report the Sargan/Hansen test for overidentification (should be insignificant) and the AR(2) test for serial correlation (should also be insignificant to validate the moment conditions).
7.4 Nonlinear Models
Linear models assume the conditional mean of Y given X is a linear function of parameters. This breaks down when:
- Binary outcomes: Linear probability models can predict probabilities outside [0,1]
- Count data: Counts are non-negative integers, but OLS can predict negative values
- Bounded/skewed outcomes: Expenditure data, durations, and other inherently positive variables
Nonlinear models address these issues by modeling a transformed version of the outcome or using appropriate distributional assumptions.
Logit and Probit
For binary outcomes (0/1), logit and probit models estimate the probability P(Y=1|X). They differ in the link function: logit uses the logistic CDF, probit uses the normal CDF. In practice, they give very similar results—the choice is often a matter of convention (probit is traditional in labor economics, logit in epidemiology).
Raw logit/probit coefficients are in log-odds (logit) or z-scores (probit), which are hard to interpret. Always compute marginal effects—the change in probability for a unit change in X. Average marginal effects (AME) average across all observations; marginal effects at the mean (MEM) evaluate at sample means. AME is generally preferred as it doesn't depend on a potentially unrepresentative "average" person.
# Python: Logit and Probit
import statsmodels.formula.api as smf
# Logit model (binary outcome)
logit = smf.logit('hired ~ education + experience + age', data=df).fit()
print(logit.summary())
# Probit model
probit = smf.probit('hired ~ education + experience + age', data=df).fit()
# Marginal effects (more interpretable)
margeff = logit.get_margeff()
print(margeff.summary())
# Average marginal effects at specific values
margeff_at_means = logit.get_margeff(at='mean')
* Stata: Logit and Probit
* Logit
logit hired education experience age
* Probit
probit hired education experience age
* Marginal effects (average marginal effects)
margins, dydx(*)
* Marginal effects at specific values
margins, dydx(*) at(education=16 experience=5)
* Odds ratios (for logit)
logit hired education experience age, or
# R: Logit and Probit
# Logit
logit <- glm(hired ~ education + experience + age,
data = df,
family = binomial(link = "logit"))
summary(logit)
# Probit
probit <- glm(hired ~ education + experience + age,
data = df,
family = binomial(link = "probit"))
# Marginal effects with margins package
library(margins)
margeff <- margins(logit)
summary(margeff)
# Odds ratios
exp(coef(logit))
Optimization terminated successfully.
Current function value: 0.512345
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: hired No. Observations: 1000
Model: Logit Df Residuals: 996
Method: MLE Df Model: 3
Date: Mon, 27 Jan 2026 Pseudo R-squ.: 0.2187
Converged: True Log-Likelihood: -512.35
LLR p-value: 1.23e-42 LL-Null: -655.78
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -4.2341 0.521 -8.127 0.000 -5.255 -3.213
education 0.3124 0.038 8.221 0.000 0.238 0.387
experience 0.1876 0.024 7.817 0.000 0.141 0.235
age -0.0234 0.012 -1.950 0.051 -0.047 0.000
==============================================================================
Logit Marginal Effects
=====================================
Dep. Variable: hired
Method: dydx
At: overall
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
education 0.0654 0.008 8.175 0.000 0.050 0.081
experience 0.0393 0.005 7.860 0.000 0.030 0.049
age -0.0049 0.003 -1.633 0.102 -0.011 0.001
==============================================================================
. logit hired education experience age
Iteration 0: log likelihood = -655.78123
Iteration 1: log likelihood = -518.23456
Iteration 2: log likelihood = -512.45678
Iteration 3: log likelihood = -512.34567
Iteration 4: log likelihood = -512.34561
Logistic regression Number of obs = 1,000
LR chi2(3) = 286.87
Prob > chi2 = 0.0000
Log likelihood = -512.34561 Pseudo R2 = 0.2187
------------------------------------------------------------------------------
hired | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .3124127 .0380123 8.22 0.000 .2379101 .3869153
experience | .1876234 .0240123 7.82 0.000 .1405601 .2346867
age | -.0234123 .0120123 -1.95 0.051 -.0469561 .0001315
_cons | -4.234127 .5210234 -8.13 0.000 -5.255315 -3.212939
------------------------------------------------------------------------------
. margins, dydx(*)
Average marginal effects Number of obs = 1,000
Model VCE : OIM
Expression : Pr(hired), predict()
dy/dx w.r.t. : education experience age
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .0654231 .0080012 8.18 0.000 .0497411 .0811051
experience | .0392876 .0049987 7.86 0.000 .0294904 .0490848
age | -.0049012 .0030012 -1.63 0.102 -.0107834 .0009810
------------------------------------------------------------------------------
. logit hired education experience age, or
Logistic regression Number of obs = 1,000
LR chi2(3) = 286.87
Log likelihood = -512.34561 Pseudo R2 = 0.2187
------------------------------------------------------------------------------
hired | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | 1.366789 .0519432 8.22 0.000 1.268798 1.472756
experience | 1.206423 .0289634 7.82 0.000 1.150923 1.264567
age | .9768543 .0117234 -1.95 0.051 .9541234 .9999876
_cons | .0145234 .0075645 -8.13 0.000 .0052345 .0402567
------------------------------------------------------------------------------
Call:
glm(formula = hired ~ education + experience + age, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3456 -0.8765 0.3456 0.7654 2.1234
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.234127 0.521023 -8.127 4.4e-16 ***
education 0.312413 0.038012 8.221 < 2e-16 ***
experience 0.187623 0.024012 7.817 5.4e-15 ***
age -0.023412 0.012012 -1.950 0.0512 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1311.6 on 999 degrees of freedom
Residual deviance: 1024.7 on 996 degrees of freedom
AIC: 1032.7
Number of Fisher Scoring iterations: 4
factor AME SE z p lower upper
education 0.0654 0.0080 8.1750 0.0000 0.0497 0.0811
experience 0.0393 0.0050 7.8600 0.0000 0.0295 0.0491
age -0.0049 0.0030 -1.6333 0.1024 -0.0108 0.0010
Odds ratios:
(Intercept) education experience age
0.0145234 1.3667894 1.2064231 0.9768543
Manual Marginal Effects for Probit/Logit
Raw logit/probit coefficients are not directly interpretable as the effect on the probability of the outcome. Because the link function is nonlinear, the marginal effect of a variable depends on the values of all covariates. There are three standard approaches to summarizing marginal effects:
- Average Marginal Effect (AME): Compute the marginal effect for each observation, then average. For probit: AMEj = (1/N) Σ φ(Xiβ) · βj. For logit: AMEj = (1/N) Σ Λ(Xiβ)(1 − Λ(Xiβ)) · βj.
- Marginal Effect at the Mean (MEM): Evaluate the marginal effect at the sample means of all covariates: φ(x̄β) · βj.
- Marginal Effect at Representative values (MER): Evaluate at chosen covariate profiles (e.g., education = 16, experience = 5).
AME is generally preferred because it does not depend on a potentially unrepresentative "average" person (e.g., 2.3 children). MEM is convenient but can be misleading if the mean profile is not meaningful. MER is useful for policy questions ("what is the effect for a 30-year-old with 16 years of education?"). All three coincide at the mean only if the model is linear—they typically differ for logit/probit.
# Python: Manual Marginal Effects for Probit
import statsmodels.api as sm
from scipy.stats import norm
import numpy as np
# Probit estimation
probit = sm.Probit(df['hired'], sm.add_constant(df[['education', 'experience', 'age']]))
result = probit.fit()
# Manual AME: mean of individual marginal effects
xb = result.predict(which='linear') # linear predictor
ame = np.mean(norm.pdf(xb).values[:, None] * result.params.values[1:], axis=0)
print("Manual AME:", ame)
# Automatic marginal effects — AME
mfx = result.get_margeff(at='overall')
mfx.summary()
# Automatic marginal effects — MEM
mfx_mean = result.get_margeff(at='mean')
mfx_mean.summary()
* Stata: Marginal Effects for Probit
* Probit estimation
probit hired education experience age
* Average Marginal Effects (AME) — default
margins, dydx(*)
* Marginal Effects at the Mean (MEM)
margins, dydx(*) atmeans
* Marginal effects at specific values (MER)
margins, dydx(education) at(experience=5 age=30)
# R: Manual Average Marginal Effect (AME) — Probit
probit <- glm(hired ~ education + experience + age,
family = binomial(link = "probit"), data = df)
beta <- coef(probit)
xb <- predict(probit, type = "link") # X * beta
# AME = mean of individual marginal effects
ame_education <- mean(dnorm(xb)) * beta["education"]
ame_experience <- mean(dnorm(xb)) * beta["experience"]
cat("AME (education):", round(ame_education, 4), "\n")
# Marginal Effect at the Mean (MEM)
x_bar <- colMeans(model.matrix(probit))
xb_bar <- sum(x_bar * beta)
mem_education <- dnorm(xb_bar) * beta["education"]
# Using margins package (automatic AME)
library(margins)
m <- margins(probit)
summary(m)
Optimization terminated successfully.
Current function value: 0.498231
Iterations 5
Manual AME: [0.0712 0.0421 -0.0053]
Probit Marginal Effects (AME)
=====================================
Dep. Variable: hired
Method: dydx
At: overall
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
education 0.0712 0.009 7.911 0.000 0.054 0.089
experience 0.0421 0.005 8.420 0.000 0.032 0.052
age -0.0053 0.003 -1.767 0.077 -0.011 0.001
==============================================================================
Probit Marginal Effects (MEM)
=====================================
Dep. Variable: hired
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
education 0.0698 0.009 7.756 0.000 0.052 0.087
experience 0.0413 0.005 8.260 0.000 0.031 0.051
age -0.0052 0.003 -1.733 0.083 -0.011 0.001
==============================================================================
. probit hired education experience age
Iteration 0: log likelihood = -655.78123
Iteration 1: log likelihood = -500.12345
Iteration 2: log likelihood = -498.23456
Iteration 3: log likelihood = -498.23100
Probit regression Number of obs = 1,000
LR chi2(3) = 315.10
Prob > chi2 = 0.0000
Log likelihood = -498.231 Pseudo R2 = 0.2403
------------------------------------------------------------------------------
hired | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .1912345 .0234567 8.15 0.000 .1452601 .2372089
experience | .1131234 .0141234 8.01 0.000 .0854421 .1408047
age | -.0142345 .0074567 -1.91 0.056 -.0288494 .0003804
_cons | -2.567890 .3201234 -8.02 0.000 -3.195321 -1.940459
------------------------------------------------------------------------------
. margins, dydx(*)
Average marginal effects Number of obs = 1,000
Expression : Pr(hired), predict()
dy/dx w.r.t. : education experience age
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .0712345 .0090123 7.91 0.000 .0535707 .0888983
experience | .0421234 .0050012 8.42 0.000 .0323212 .0519256
age | -.0053012 .0030012 -1.77 0.077 -.0111835 .0005811
------------------------------------------------------------------------------
. margins, dydx(*) atmeans
Conditional marginal effects Number of obs = 1,000
Expression : Pr(hired), predict()
dy/dx w.r.t. : education experience age
at : education = 14.23 (mean)
experience = 8.76 (mean)
age = 35.12 (mean)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .0698123 .0090012 7.76 0.000 .0521702 .0874544
experience | .0412876 .0049987 8.26 0.000 .0314904 .0510848
age | -.0051912 .0029987 -1.73 0.083 -.0110685 .0006861
------------------------------------------------------------------------------
AME (education): 0.0712
factor AME SE z p lower upper
education 0.0712 0.0090 7.9111 0.0000 0.0536 0.0889
experience 0.0421 0.0050 8.4200 0.0000 0.0323 0.0519
age -0.0053 0.0030 -1.7667 0.0773 -0.0112 0.0006
MEM (education): 0.0698
Comparison of AME vs MEM:
AME MEM Difference
education 0.0712 0.0698 0.0014
experience 0.0421 0.0413 0.0008
age -0.0053 -0.0052 0.0001
Multinomial Logit
When the outcome is categorical with more than two unordered categories (e.g., occupation choice, transport mode, brand preference), we use the multinomial logit model. It estimates the probability of each category relative to a base category: log(P(Y=j) / P(Y=base)) = Xβj. The model produces one set of coefficients per non-base outcome. Coefficients are in terms of relative risk ratios (exponentiated) or log-odds relative to the base.
Multinomial logit assumes that the ratio of probabilities between any two alternatives is independent of the other alternatives available (IIA property). This is the famous "red bus/blue bus" problem: adding a blue bus should not change the relative odds of taking a car vs. a red bus, but IIA forces it to. If IIA is implausible, consider nested logit, mixed logit, or multinomial probit instead. You can test IIA with the Hausman-McFadden test or the Small-Hsiao test.
# Python: Multinomial Logit
import statsmodels.api as sm
# Multinomial logit using statsmodels
mlogit = sm.MNLogit(df['occupation'],
sm.add_constant(df[['education', 'experience', 'age']]))
result = mlogit.fit()
print(result.summary())
# Predicted probabilities for each category
pred_probs = result.predict()
print(pred_probs.head())
# Marginal effects
mfx = result.get_margeff()
mfx.summary()
* Stata: Multinomial Logit
* Multinomial logit with base outcome 1
mlogit occupation education experience age, baseoutcome(1)
* Relative risk ratios (exponentiated coefficients)
mlogit, rrr
* Predicted probabilities
predict p1 p2 p3, pr
* Marginal effects for a specific outcome
margins, dydx(*) predict(outcome(2))
# R: Multinomial Logit
library(nnet)
# Multinomial logit: choice among J alternatives
mlogit <- multinom(occupation ~ education + experience + age, data = df)
summary(mlogit)
# Predicted probabilities
pred_probs <- predict(mlogit, type = "probs")
head(pred_probs)
# Relative risk ratios (exponentiated coefficients)
exp(coef(mlogit))
# Marginal effects: dP_j/dx_k = P_j * (beta_jk - sum_m P_m * beta_mk)
library(margins)
m <- margins(mlogit)
Optimization terminated successfully.
MNLogit Regression Results
==============================================================================
Dep. Variable: occupation No. Observations: 1000
Model: MNLogit Df Residuals: 992
Method: MLE Df Model: 6
Date: Mon, 27 Jan 2026 Pseudo R-squ.: 0.1456
Log-Likelihood: -876.54 LL-Null: -1025.8
LLR p-value: 2.34e-61
==============================================================================
occupation=2 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -2.1234 0.412 -5.155 0.000 -2.931 -1.316
education 0.2345 0.031 7.565 0.000 0.174 0.295
experience 0.0876 0.021 4.171 0.000 0.047 0.129
age -0.0123 0.009 -1.367 0.172 -0.030 0.005
------------------------------------------------------------------------------
occupation=3 coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -3.4567 0.523 -6.610 0.000 -4.482 -2.432
education 0.3456 0.038 9.095 0.000 0.271 0.420
experience 0.1234 0.025 4.936 0.000 0.074 0.173
age 0.0087 0.011 0.791 0.429 -0.013 0.030
==============================================================================
Predicted probabilities (first 5 obs):
occupation=1 occupation=2 occupation=3
0 0.4523 0.3214 0.2263
1 0.3876 0.3567 0.2557
2 0.5123 0.2876 0.2001
3 0.2987 0.3876 0.3137
4 0.4234 0.3123 0.2643
. mlogit occupation education experience age, baseoutcome(1)
Iteration 0: log likelihood = -1025.8123
Iteration 1: log likelihood = -882.4567
Iteration 2: log likelihood = -876.5678
Iteration 3: log likelihood = -876.5432
Multinomial logistic regression Number of obs = 1,000
LR chi2(6) = 298.54
Prob > chi2 = 0.0000
Log likelihood = -876.5432 Pseudo R2 = 0.1456
------------------------------------------------------------------------------
occupation | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1 | (base outcome)
-------------+----------------------------------------------------------------
2 |
education | .2345123 .0310234 7.57 0.000 .1737076 .2953170
experience | .0876234 .0210123 4.17 0.000 .0464399 .1288069
age | -.0123456 .0090234 -1.37 0.172 -.0300312 .0053400
_cons | -2.123456 .4120234 -5.16 0.000 -2.930987 -1.315925
-------------+----------------------------------------------------------------
3 |
education | .3456789 .0380123 9.10 0.000 .2711760 .4201818
experience | .1234567 .0250123 4.94 0.000 .0744334 .1724800
age | .0087654 .0110234 0.80 0.426 -.0128400 .0303708
_cons | -3.456789 .5230123 -6.61 0.000 -4.481874 -2.431704
------------------------------------------------------------------------------
Call: multinom(formula = occupation ~ education + experience + age, data = df) Coefficients: (Intercept) education experience age 2 -2.123456 0.2345123 0.0876234 -0.012346 3 -3.456789 0.3456789 0.1234567 0.008765 Std. Errors: (Intercept) education experience age 2 0.4120234 0.0310234 0.0210123 0.009023 3 0.5230123 0.0380123 0.0250123 0.011023 Residual Deviance: 1753.086 AIC: 1769.086 Relative risk ratios: (Intercept) education experience age 2 0.1196234 1.2643567 1.0915234 0.987716 3 0.0316789 1.4130234 1.1313567 1.008804
Ordered Logit/Probit
When the outcome is ordinal—categories with a natural ordering but unknown spacing (e.g., satisfaction: 1=very dissatisfied to 5=very satisfied, credit ratings, self-reported health)—we use ordered logit (proportional odds model) or ordered probit. These models estimate a latent continuous variable Y* = Xβ + ε, with cutpoints α1 < α2 < … < αJ−1 that map the latent variable to observed categories. The key assumption is proportional odds: the effect of X is the same across all cutpoints.
The proportional odds (parallel lines) assumption means that the coefficient β shifts the entire latent distribution uniformly, regardless of which cutpoint you consider. Test it with the Brant test. If it fails, consider a generalized ordered logit (where coefficients vary by cutpoint) or a multinomial logit that does not impose ordering.
# Python: Ordered Logit/Probit
from statsmodels.miscmodels.ordinal_model import OrderedModel
# Ordered logit
ologit = OrderedModel(df['satisfaction'],
df[['income', 'education', 'age']],
distr='logit')
result = ologit.fit(method='bfgs')
print(result.summary())
# Ordered probit
oprobit = OrderedModel(df['satisfaction'],
df[['income', 'education', 'age']],
distr='probit')
result_p = oprobit.fit(method='bfgs')
# Predicted probabilities for each category
pred_probs = result.predict()
print(pred_probs.head())
* Stata: Ordered Logit/Probit
* Ordered logit
ologit satisfaction income education age
* Ordered probit
oprobit satisfaction income education age
* Predicted probabilities for each category
predict p1 p2 p3 p4 p5, pr
* Marginal effects for a specific outcome category
margins, dydx(*) predict(outcome(3))
* Brant test for proportional odds assumption
brant
# R: Ordered Logit/Probit
library(MASS)
# Ordered logit (proportional odds model)
ologit <- polr(factor(satisfaction) ~ income + education + age,
data = df, method = "logistic")
summary(ologit)
# Ordered probit
oprobit <- polr(factor(satisfaction) ~ income + education + age,
data = df, method = "probit")
# Predicted probabilities for each category
pred_probs <- predict(ologit, type = "probs")
# Brant test for proportional odds assumption
library(brant)
brant(ologit)
Optimization terminated successfully.
OrderedModel Results
==============================================================================
Dep. Variable: satisfaction Log-Likelihood: -1234.5
Model: OrderedModel AIC: 2483.0
Method: Maximum Likelihood BIC: 2517.3
No. Observations: 1000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
income 0.0234 0.004 5.850 0.000 0.016 0.031
education 0.1567 0.028 5.596 0.000 0.102 0.212
age 0.0098 0.006 1.633 0.103 -0.002 0.022
1/2 -1.2345 0.234 -5.276 0.000 -1.693 -0.776
2/3 -0.1234 0.198 -0.623 0.533 -0.512 0.265
3/4 0.8765 0.201 4.361 0.000 0.483 1.270
4/5 2.1234 0.267 7.952 0.000 1.600 2.647
==============================================================================
Predicted probabilities (first 5 obs):
P(Y=1) P(Y=2) P(Y=3) P(Y=4) P(Y=5)
0 0.0876 0.1543 0.2876 0.3123 0.1582
1 0.0654 0.1234 0.2567 0.3456 0.2089
2 0.1123 0.1876 0.3012 0.2765 0.1224
3 0.0432 0.0987 0.2234 0.3567 0.2780
4 0.0765 0.1456 0.2789 0.3234 0.1756
. ologit satisfaction income education age
Iteration 0: log likelihood = -1456.7890
Iteration 1: log likelihood = -1238.4567
Iteration 2: log likelihood = -1234.5678
Iteration 3: log likelihood = -1234.5432
Ordered logistic regression Number of obs = 1,000
LR chi2(3) = 444.49
Prob > chi2 = 0.0000
Log likelihood = -1234.5432 Pseudo R2 = 0.1525
------------------------------------------------------------------------------
satisfaction | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .0234123 .0040012 5.85 0.000 .0155702 .0312544
education | .1567234 .0280123 5.60 0.000 .1018202 .2116266
age | .0098765 .0060456 1.63 0.103 -.0019726 .0217256
-------------+----------------------------------------------------------------
/cut1 | -1.234567 .2340234 -1.693244 -.775890
/cut2 | -.1234567 .1980234 -.511575 .264662
/cut3 | .8765432 .2010234 .482545 1.270541
/cut4 | 2.123456 .2670234 1.600099 2.646813
------------------------------------------------------------------------------
. brant
Brant Test of Parallel Regression Assumption
chi2 p>chi2 df
-----------------------------------------
All 8.234 0.2216 6
income 2.123 0.5467 2
education 3.456 0.1776 2
age 1.876 0.3912 2
-----------------------------------------
A non-significant test indicates that the proportional
odds assumption has NOT been violated.
Call:
polr(formula = factor(satisfaction) ~ income + education + age,
data = df, method = "logistic")
Coefficients:
Value Std. Error t value
income 0.02341 0.004001 5.8510
education 0.15672 0.028012 5.5950
age 0.00988 0.006046 1.6340
Intercepts:
Value Std. Error t value
1|2 -1.2346 0.2340 -5.2762
2|3 -0.1235 0.1980 -0.6234
3|4 0.8765 0.2010 4.3610
4|5 2.1234 0.2670 7.9520
Residual Deviance: 2469.09
AIC: 2483.09
--------------------------------------------
Brant Test (H0: Parallel Regression)
X2 df probability
Omnibus Test 8.234 6 0.2216
income 2.123 2 0.5467
education 3.456 2 0.1776
age 1.876 2 0.3912
--------------------------------------------
Parallel regression assumption holds (p > 0.05)
Bivariate Probit
The bivariate probit model jointly estimates two binary outcomes whose error terms may be correlated: Y1* = X1β1 + ε1 and Y2* = X2β2 + ε2, where corr(ε1, ε2) = ρ. Key applications include: (1) selection models where the outcome equation and selection equation are correlated, (2) IV estimation when both the endogenous variable and the outcome are binary, and (3) modeling any two inherently related binary decisions (e.g., labor force participation and home ownership).
When you have a binary endogenous regressor and a binary outcome, standard 2SLS can still be used and is consistent (though not efficient) if the first stage is well-specified. However, bivariate probit is more efficient under correct specification because it models the joint distribution. The test of ρ = 0 serves as an endogeneity test. If ρ is not significantly different from zero, single-equation probit suffices.
# Python: Bivariate Probit via custom MLE
# No standard package — implement via maximum likelihood
from scipy.optimize import minimize
from scipy.stats import mvn
import numpy as np
def biprobit_ll(params, X1, X2, y1, y2):
k1 = X1.shape[1]
k2 = X2.shape[1]
beta1 = params[:k1]
beta2 = params[k1:k1+k2]
rho = np.tanh(params[-1]) # ensure -1 < rho < 1
xb1 = X1 @ beta1
xb2 = X2 @ beta2
# Log-likelihood using bivariate normal CDF
ll = 0
for i in range(len(y1)):
lower = [-np.inf if y1[i]==0 else 0,
-np.inf if y2[i]==0 else 0]
upper = [0 if y1[i]==0 else np.inf,
0 if y2[i]==0 else np.inf]
# Shift by linear predictor
lower_adj = [lower[0]-xb1[i], lower[1]-xb2[i]]
upper_adj = [upper[0]-xb1[i], upper[1]-xb2[i]]
cov = np.array([[1, rho],[rho, 1]])
prob, _ = mvn.mvnun(lower_adj, upper_adj, [0,0], cov)
ll += np.log(max(prob, 1e-10))
return -ll
# In practice, use Stata or R for bivariate probit
* Stata: Bivariate Probit
* Bivariate probit with exclusion restriction
biprobit (y1 = x1 x2 z1) (y2 = x1 x2)
* Test rho = 0 (exogeneity test)
* Reported automatically in output as Wald test of rho=0
* Predicted probabilities
predict p11, p11 /* Pr(Y1=1, Y2=1) */
predict p1, pmarg1 /* Pr(Y1=1) marginal */
# R: Bivariate Probit using GJRM
library(GJRM)
# Specify two equations (with exclusion restriction in eq 1)
f.list <- list(y1 ~ x1 + x2 + z1, # equation 1
y2 ~ x1 + x2) # equation 2
# Bivariate probit estimation
biprobit <- gjrm(f.list, data = df,
margins = c("probit", "probit"),
Model = "B")
summary(biprobit)
# Alternatively, using mvProbit (simulated ML)
# library(mvProbit)
# mvProbit(cbind(y1, y2) ~ x1 + x2, data = df)
# Bivariate probit typically estimated in Stata or R.
# Python custom MLE output:
Optimization terminated successfully.
Estimated parameters:
Equation 1 (y1):
const: -0.8765 (0.2345)
x1: 0.4567 (0.0876)
x2: 0.2345 (0.0654)
z1: 0.3210 (0.0987)
Equation 2 (y2):
const: -0.5432 (0.1987)
x1: 0.3456 (0.0765)
x2: 0.1876 (0.0543)
rho: 0.3456 (0.0876)
Wald test of rho=0: chi2(1) = 15.56, p = 0.0001
. biprobit (y1 = x1 x2 z1) (y2 = x1 x2)
Fitting comparison equation 1:
Iteration 0: log likelihood = -567.8901
Iteration 1: log likelihood = -523.4567
Fitting comparison equation 2:
Iteration 0: log likelihood = -612.3456
Iteration 1: log likelihood = -578.9012
Fitting full model:
Iteration 0: log likelihood = -1098.7654
Iteration 1: log likelihood = -1045.6789
Iteration 2: log likelihood = -1042.3456
Iteration 3: log likelihood = -1042.3412
Bivariate probit regression Number of obs = 1,000
Wald chi2(7) = 187.65
Log likelihood = -1042.3412 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y1 |
x1 | .4567123 .0876234 5.21 0.000 .2849736 .6284510
x2 | .2345678 .0654321 3.58 0.000 .1063233 .3628123
z1 | .3210456 .0987654 3.25 0.001 .1274690 .5146222
_cons | -.8765432 .2345678 -3.74 0.000 -1.336287 -.416799
-------------+----------------------------------------------------------------
y2 |
x1 | .3456789 .0765432 4.51 0.000 .1956570 .4957008
x2 | .1876543 .0543210 3.45 0.001 .0811870 .2941216
_cons | -.5432109 .1987654 -2.73 0.006 -.9327839 -.1536379
-------------+----------------------------------------------------------------
/athrho | .3598765 .0912345 3.95 0.000 .1810601 .5386929
-------------+----------------------------------------------------------------
rho | .3456123 .0812345 .1793456 .4924567
------------------------------------------------------------------------------
Wald test of rho=0: chi2(1) = 15.5612 Prob > chi2 = 0.0001
GJRM Bivariate Probit Model
Equation 1: y1 ~ x1 + x2 + z1
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.876543 0.234568 -3.737 0.0002 ***
x1 0.456712 0.087623 5.213 0.0000 ***
x2 0.234568 0.065432 3.584 0.0003 ***
z1 0.321046 0.098765 3.250 0.0012 **
Equation 2: y2 ~ x1 + x2
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.543211 0.198765 -2.733 0.0063 **
x1 0.345679 0.076543 4.515 0.0000 ***
x2 0.187654 0.054321 3.454 0.0006 ***
Estimated rho: 0.3456 (SE: 0.0812)
Wald test of rho=0: chi2 = 15.56, p-value = 0.0001
n = 1000, Log-Lik = -1042.34
Control Function in Nonlinear Models
Standard 2SLS does not produce consistent estimates in nonlinear models (probit, logit, Poisson) because the endogeneity bias does not average out as it does in the linear case. The control function approach (Rivers & Vuong, 1988) provides a solution: (1) estimate a first-stage OLS regression of the endogenous variable on instruments and controls, (2) include the first-stage residuals as an additional regressor in the nonlinear second stage. The coefficient on the residuals serves as an endogeneity test: if significantly different from zero, the variable is endogenous.
Naive standard errors from step 2 are incorrect because they ignore the estimation uncertainty from step 1 (the generated regressor problem). Always bootstrap the entire two-step procedure to obtain valid standard errors and confidence intervals. This means re-estimating both the first stage and second stage in each bootstrap iteration.
For the linear version of the control function approach, see Module 6D: Control Function.
# Python: Control Function Approach for Probit
import statsmodels.api as sm
import numpy as np
from sklearn.utils import resample
# Step 1: First-stage OLS
X_fs = sm.add_constant(df[['instrument', 'controls']])
fs = sm.OLS(df['endogenous_x'], X_fs).fit()
df['v_hat'] = fs.resid
# Step 2: Control function probit (include first-stage residuals)
X_cf = sm.add_constant(df[['endogenous_x', 'controls', 'v_hat']])
cf_probit = sm.Probit(df['y'], X_cf).fit()
print(cf_probit.summary())
# Test endogeneity: check significance of v_hat
print(f"v_hat coef: {cf_probit.params['v_hat']:.4f}")
print(f"v_hat p-value: {cf_probit.pvalues['v_hat']:.4f}")
# Step 3: Bootstrap for correct SEs
boot_coefs = []
for _ in range(999):
boot_df = resample(df)
X_fs_b = sm.add_constant(boot_df[['instrument', 'controls']])
fs_b = sm.OLS(boot_df['endogenous_x'], X_fs_b).fit()
boot_df['v_hat'] = fs_b.resid
X_cf_b = sm.add_constant(boot_df[['endogenous_x', 'controls', 'v_hat']])
cf_b = sm.Probit(boot_df['y'], X_cf_b).fit(disp=0)
boot_coefs.append(cf_b.params.values)
boot_se = np.std(boot_coefs, axis=0)
print("Bootstrap SEs:", boot_se)
* Stata: Control Function in Probit
* Step 1: First stage OLS
reg endogenous_x instrument controls
predict v_hat, resid
* Step 2: Probit with first-stage residuals
probit y endogenous_x controls v_hat
* Test endogeneity (is v_hat significant?)
test v_hat
* Step 3: Bootstrap for correct SEs
bootstrap, reps(999): ///
probit_cf y endogenous_x controls, ///
endog(endogenous_x) inst(instrument)
* Or use ivprobit directly (MLE-based)
ivprobit y controls (endogenous_x = instrument)
# R: Control Function Approach for Probit
# Step 1: First-stage OLS
first_stage <- lm(endogenous_x ~ instrument + controls, data = df)
df$v_hat <- residuals(first_stage)
# Step 2: Probit with residuals (Rivers-Vuong, 1988)
cf_probit <- glm(y ~ endogenous_x + controls + v_hat,
family = binomial(link = "probit"), data = df)
summary(cf_probit)
# Test endogeneity: is v_hat significant?
# H0: endogenous_x is exogenous (v_hat coef = 0)
# Step 3: Bootstrap for correct standard errors
library(boot)
cf_boot <- function(data, indices) {
d <- data[indices, ]
fs <- lm(endogenous_x ~ instrument + controls, data = d)
d$v_hat <- residuals(fs)
ss <- glm(y ~ endogenous_x + controls + v_hat,
family = binomial(link = "probit"), data = d)
coef(ss)
}
boot_results <- boot(df, cf_boot, R = 999)
# Bootstrap SEs
apply(boot_results$t, 2, sd)
Optimization terminated successfully.
Current function value: 0.467234
Iterations 6
Control Function Probit Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: Probit Df Residuals: 996
Method: MLE Df Model: 3
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.2345 0.312 -3.957 0.000 -1.846 -0.623
endogenous_x 0.4567 0.098 4.660 0.000 0.265 0.649
controls 0.1234 0.045 2.742 0.006 0.035 0.212
v_hat -0.2876 0.087 -3.306 0.001 -0.458 -0.117
==============================================================================
v_hat coef: -0.2876
v_hat p-value: 0.0009 (endogeneity confirmed)
Bootstrap SEs (999 replications):
const: 0.3345 endogenous_x: 0.1123 controls: 0.0498 v_hat: 0.0956
Note: Naive SEs underestimate uncertainty. Bootstrap SEs are larger.
. reg endogenous_x instrument controls
Source | SS df MS Number of obs = 1,000
-------------+---------------------------------- F(2, 997) = 123.45
Model | 234.567890 2 117.283945 Prob > F = 0.0000
Residual | 876.543210 997 .879182 R-squared = 0.2112
-------------+---------------------------------- Root MSE = .93762
Total | 1111.11110 999 1.1122233
------------------------------------------------------------------------------
endogenous_x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
instrument | .5678901 .0456789 12.43 0.000 .4782345 .6575457
controls | .2345678 .0345678 6.79 0.000 .1667543 .3023813
_cons | 1.234567 .1234567 10.00 0.000 .9923456 1.476789
------------------------------------------------------------------------------
. probit y endogenous_x controls v_hat
Probit regression Number of obs = 1,000
Wald chi2(3) = 67.89
Prob > chi2 = 0.0000
Log likelihood = -467.2345 Pseudo R2 = 0.2567
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
endogenous_x | .4567234 .0980123 4.66 0.000 .2646228 .6488240
controls | .1234567 .0450123 2.74 0.006 .0352342 .2116792
v_hat | -.2876543 .0870234 -3.31 0.001 -.4582171 -.1170915
_cons | -1.234567 .3120345 -3.96 0.000 -1.846143 -.622991
------------------------------------------------------------------------------
. test v_hat
( 1) [y]v_hat = 0
chi2( 1) = 10.93
Prob > chi2 = 0.0009 (endogeneity confirmed)
. ivprobit y controls (endogenous_x = instrument)
Probit model with endogenous regressors Number of obs = 1,000
Wald chi2(2) = 45.67
Log likelihood = -467.8901 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
endogenous_x | .4523456 .1123456 4.03 0.000 .2321523 .6725389
controls | .1198765 .0498765 2.40 0.016 .0221202 .2176328
_cons | -1.212345 .3345678 -3.62 0.000 -1.868086 -.556604
------------------------------------------------------------------------------
Wald test of exogeneity: chi2(1) = 10.89 Prob > chi2 = 0.0010
Step 1 — First Stage:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23457 0.12346 10.000 <2e-16 ***
instrument 0.56789 0.04568 12.432 <2e-16 ***
controls 0.23457 0.03457 6.786 1.8e-11 ***
---
F-statistic: 123.5 on 2 and 997 DF, p-value: < 2.2e-16
Step 2 — Control Function Probit:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.234567 0.312035 -3.957 0.00008 ***
endogenous_x 0.456723 0.098012 4.660 3.2e-06 ***
controls 0.123457 0.045012 2.743 0.00609 **
v_hat -0.287654 0.087023 -3.306 0.00095 ***
Endogeneity test (v_hat): z = -3.306, p = 0.00095
=> Reject H0 of exogeneity
Bootstrap Standard Errors (999 replications):
Naive SE Boot SE Ratio
(Intercept) 0.3120 0.3345 1.072
endogenous_x 0.0980 0.1123 1.146
controls 0.0450 0.0498 1.107
v_hat 0.0870 0.0956 1.099
Poisson and Negative Binomial (Count Data)
Count outcomes (number of patents, doctor visits, accidents) require models that respect their non-negative, integer nature. The Poisson model assumes E[Y|X] = exp(Xβ) and Var(Y|X) = E[Y|X]—the mean equals the variance. This "equidispersion" assumption often fails in practice; when variance exceeds the mean (overdispersion), use negative binomial instead.
An important insight: Poisson regression with robust standard errors is consistent even if the data aren't Poisson-distributed, as long as the conditional mean is correctly specified. This "Poisson Pseudo-MLE" is widely used in trade economics for gravity equations (Santos Silva & Tenreyro, 2006) because it handles zeros naturally and is robust to heteroskedasticity.
# Python: Poisson Regression
import statsmodels.formula.api as smf
# Poisson (count data)
poisson = smf.poisson('num_patents ~ rd_spending + firm_size', data=df).fit()
print(poisson.summary())
# Negative binomial (allows overdispersion)
from statsmodels.discrete.discrete_model import NegativeBinomial
nb = smf.negativebinomial('num_patents ~ rd_spending + firm_size', data=df).fit()
# Robust Poisson (PPML for trade gravity equations)
poisson_robust = smf.poisson('trade ~ distance + gdp_origin + gdp_dest', data=df).fit(
cov_type='HC1'
)
* Stata: Count Data Models
* Poisson
poisson num_patents rd_spending firm_size
* Negative binomial
nbreg num_patents rd_spending firm_size
* Poisson with robust SEs (PPML)
ppmlhdfe trade distance gdp_origin gdp_dest, absorb(origin dest) cluster(pair)
* Check for overdispersion
estat gof
# R: Count Data Models
# Poisson
poisson <- glm(num_patents ~ rd_spending + firm_size,
data = df,
family = poisson())
summary(poisson)
# Negative binomial
library(MASS)
nb <- glm.nb(num_patents ~ rd_spending + firm_size, data = df)
summary(nb)
# With fixest (fast, with fixed effects)
library(fixest)
poisson_fe <- fepois(num_patents ~ rd_spending | firm_id + year,
data = df)
Optimization terminated successfully.
Current function value: 2.345678
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: num_patents No. Observations: 800
Model: Poisson Df Residuals: 797
Method: MLE Df Model: 2
Date: Mon, 27 Jan 2026 Pseudo R-squ.: 0.1876
Converged: True Log-Likelihood: -1876.5
LLR p-value: 3.45e-38 LL-Null: -2310.2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.4523 0.087 5.199 0.000 0.282 0.623
rd_spending 0.0234 0.003 7.800 0.000 0.018 0.029
firm_size 0.0876 0.012 7.300 0.000 0.064 0.111
==============================================================================
Interpretation: A 1 million increase in R&D spending is associated with
exp(0.0234) = 1.024 times more patents (2.4% increase).
Mean of num_patents: 4.32
Variance of num_patents: 12.87
Dispersion ratio: 2.98 (evidence of overdispersion)
. poisson num_patents rd_spending firm_size
Iteration 0: log likelihood = -2310.2345
Iteration 1: log likelihood = -1889.4567
Iteration 2: log likelihood = -1876.7890
Iteration 3: log likelihood = -1876.5432
Iteration 4: log likelihood = -1876.5432
Poisson regression Number of obs = 800
LR chi2(2) = 867.38
Prob > chi2 = 0.0000
Log likelihood = -1876.5432 Pseudo R2 = 0.1876
------------------------------------------------------------------------------
num_patents | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rd_spending | .0234123 .0030012 7.80 0.000 .0175301 .0292945
firm_size | .0876234 .0120012 7.30 0.000 .0640816 .1111652
_cons | .4523456 .0870123 5.20 0.000 .2818036 .6228876
------------------------------------------------------------------------------
. estat gof
Goodness-of-fit chi2 = 2345.678
Prob > chi2(797) = 0.0000
Note: Large chi2 suggests overdispersion. Consider negative binomial.
. nbreg num_patents rd_spending firm_size
Negative binomial regression Number of obs = 800
LR chi2(2) = 234.56
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -1654.3210 Pseudo R2 = 0.0662
------------------------------------------------------------------------------
num_patents | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rd_spending | .0198765 .0042345 4.70 0.000 .0115771 .0281759
firm_size | .0756432 .0165432 4.57 0.000 .0432191 .1080673
_cons | .5234567 .1123456 4.66 0.000 .3032615 .7436519
-------------+----------------------------------------------------------------
/lnalpha | .4567891 .0876543 .2849888 .6285894
-------------+----------------------------------------------------------------
alpha | 1.578912 .1384567 1.329876 1.874987
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 444.44 Prob >= chibar2 = 0.000
Call:
glm(formula = num_patents ~ rd_spending + firm_size, family = poisson(),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4567 -1.2345 -0.2345 0.8765 4.5678
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.452346 0.087012 5.199 2.01e-07 ***
rd_spending 0.023412 0.003001 7.800 6.21e-15 ***
firm_size 0.087623 0.012001 7.300 2.87e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4620.5 on 799 degrees of freedom
Residual deviance: 3753.1 on 797 degrees of freedom
AIC: 3759.1
Overdispersion test: ratio = 2.98 (variance/mean)
Consider negative binomial regression.
Call: glm.nb(formula = num_patents ~ rd_spending + firm_size)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.523457 0.112346 4.659 3.17e-06 ***
rd_spending 0.019877 0.004235 4.695 2.67e-06 ***
firm_size 0.075643 0.016543 4.573 4.80e-06 ***
Theta: 1.5789
Std. Err.: 0.1385
2 x log-likelihood: -3304.642
7.5 MLE and GMM
Maximum Likelihood Estimation (MLE) and Generalized Method of Moments (GMM) are the two general frameworks underlying most econometric estimators. Understanding them helps you: (1) implement custom models not available in standard packages, (2) understand the properties of standard estimators, and (3) diagnose estimation problems.
Maximum Likelihood (MLE)
Find parameters that maximize the probability of observing the data.
- Requires specifying the full distribution of Y|X
- Efficient if the distribution is correct
- Inconsistent if distribution is misspecified
- Examples: Logit, Probit, Tobit, Poisson
Generalized Method of Moments (GMM)
Find parameters that make moment conditions hold in the sample.
- Only requires specifying moment conditions E[g(Y, X, θ)] = 0
- More robust but less efficient than MLE
- Includes OLS and IV as special cases
- Examples: IV/2SLS, panel GMM, difference GMM
Custom MLE
Sometimes you need to estimate a model that isn't available in standard packages. The approach is straightforward: (1) write down the log-likelihood function, (2) use numerical optimization to find the maximum. The code below shows a simple example—linear regression estimated via MLE rather than OLS.
The key components are: (1) defining the negative log-likelihood function (negative because optimizers minimize by default), (2) choosing starting values, and (3) selecting an optimization algorithm (BFGS is a good default). Standard errors come from the inverse of the Hessian matrix at the optimum. For complex models, check that the optimizer converged and try multiple starting values.
# Python: Custom MLE
from scipy.optimize import minimize
import numpy as np
def neg_log_likelihood(params, X, y):
"""Negative log-likelihood for linear regression."""
beta = params[:-1]
sigma = params[-1]
if sigma <= 0:
return np.inf
y_hat = X @ beta
residuals = y - y_hat
# Normal log-likelihood
ll = -0.5 * len(y) * np.log(2 * np.pi * sigma**2)
ll -= np.sum(residuals**2) / (2 * sigma**2)
return -ll # negative because we minimize
# Estimate
X = np.column_stack([np.ones(len(df)), df['x1'], df['x2']])
y = df['y'].values
initial = np.zeros(X.shape[1] + 1)
initial[-1] = 1 # initial sigma
result = minimize(neg_log_likelihood, initial, args=(X, y), method='BFGS')
print("MLE estimates:", result.x)
* Stata: Custom MLE with ml
* Define the log-likelihood function
program mylogit_ll
args lnf xb
quietly replace `lnf' = $ML_y1 * `xb' - ln(1 + exp(`xb'))
end
* Estimate
ml model lf mylogit_ll (outcome = x1 x2)
ml maximize
# R: Custom MLE
# Define negative log-likelihood
neg_ll <- function(params, X, y) {
beta <- params[-length(params)]
sigma <- params[length(params)]
if (sigma <= 0) return(Inf)
y_hat <- X %*% beta
resid <- y - y_hat
ll <- -0.5 * length(y) * log(2 * pi * sigma^2)
ll <- ll - sum(resid^2) / (2 * sigma^2)
return(-ll)
}
# Estimate
X <- cbind(1, df$x1, df$x2)
y <- df$y
initial <- c(rep(0, ncol(X)), 1)
result <- optim(initial, neg_ll, X = X, y = y, method = "BFGS", hessian = TRUE)
print(result$par)
# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))
Optimization terminated successfully.
Current function value: 567.234567
Iterations: 23
Function evaluations: 45
Gradient evaluations: 45
MLE estimates: [ 2.34567123 1.87654321 0.54321098 1.23456789]
Parameter interpretation:
beta_0 (intercept): 2.3457
beta_1 (x1): 1.8765
beta_2 (x2): 0.5432
sigma: 1.2346
Comparison with OLS:
MLE OLS
Intercept 2.3457 2.3457
x1 1.8765 1.8765
x2 0.5432 0.5432
Sigma/RMSE 1.2346 1.2346
Note: MLE and OLS give identical point estimates for linear regression
with normally distributed errors. MLE uses n in denominator for sigma,
OLS uses (n-k) giving slightly larger estimate.
. ml model lf mylogit_ll (outcome = x1 x2)
. ml maximize
initial: log likelihood = -345.67891
alternative: log likelihood = -234.56789
rescale: log likelihood = -189.01234
Iteration 0: log likelihood = -189.01234
Iteration 1: log likelihood = -156.78901
Iteration 2: log likelihood = -145.67890
Iteration 3: log likelihood = -145.23456
Iteration 4: log likelihood = -145.23451
Iteration 5: log likelihood = -145.23451
Maximum likelihood estimation Number of obs = 500
Wald chi2(2) = 123.45
Log likelihood = -145.23451 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1.234567 .1234567 10.00 0.000 .9925922 1.476542
x2 | .5678901 .0876543 6.48 0.000 .3960899 .7396903
_cons | -2.345678 .3456789 -6.79 0.000 -3.023197 -1.668159
------------------------------------------------------------------------------
Note: Custom MLE converged successfully.
Log-likelihood at maximum: -145.23451
$par
[1] 2.34567123 1.87654321 0.54321098 1.23456789
$value
[1] 567.2346
$counts
function gradient
45 45
$convergence
[1] 0
$message
NULL
Parameter estimates:
Intercept: 2.3457
x1: 1.8765
x2: 0.5432
sigma: 1.2346
Standard errors (from Hessian):
SE(Intercept): 0.2134
SE(x1): 0.1023
SE(x2): 0.0876
SE(sigma): 0.0392
95% Confidence intervals:
Estimate SE Lower Upper
Intercept 2.3457 0.2134 1.9274 2.7640
x1 1.8765 0.1023 1.6760 2.0770
x2 0.5432 0.0876 0.3715 0.7149
sigma 1.2346 0.0392 1.1578 1.3114
Log-likelihood at maximum: -567.2346
AIC: 1142.469
BIC: 1159.234
Newton-Raphson Implementation
Newton-Raphson is the iterative algorithm behind most MLE estimators. Understanding it helps diagnose convergence issues. The update rule is: θt+1 = θt − H−1(θt) · g(θt), where g is the gradient (score) and H is the Hessian.
import numpy as np
from scipy.special import expit # logistic function
def newton_raphson(X, y, tol=1e-8, max_iter=100):
"""Newton-Raphson for logistic regression."""
beta = np.zeros(X.shape[1])
for iteration in range(max_iter):
# Predicted probabilities
p = expit(X @ beta)
# Score (gradient)
score = X.T @ (y - p)
# Hessian
W = np.diag(p * (1 - p))
hessian = -X.T @ W @ X
# Update
delta = np.linalg.solve(hessian, score)
beta = beta - delta
if np.max(np.abs(delta)) < tol:
print(f"Converged in {iteration + 1} iterations")
break
se = np.sqrt(np.diag(np.linalg.inv(-hessian)))
return {'coef': beta, 'se': se, 'iterations': iteration + 1}
# Apply
X = np.column_stack([np.ones(len(df)), df['education'],
df['experience'], df['age']])
y = df['hired'].values
result = newton_raphson(X, y)
* Stata uses Newton-Raphson by default for ml
* You can see iterations:
logit hired education experience age, iterate(100) trace
* Mata implementation of Newton-Raphson
mata:
void newton_raphson(string scalar yvar, string scalar xvars) {
y = st_data(., yvar)
X = st_data(., tokens(xvars))
X = J(rows(X),1,1), X // add constant
beta = J(cols(X),1,0) // starting values
for (iter=1; iter<=100; iter++) {
p = invlogit(X*beta)
score = X'*(y - p)
W = diag(p :* (1:-p))
H = -X'*W*X
delta = lusolve(H, score)
beta = beta - delta
if (max(abs(delta)) < 1e-8) break
}
se = sqrt(diagonal(luinv(-H)))
beta, se
}
newton_raphson("hired", "education experience age")
end
# Newton-Raphson for logistic regression (manual implementation)
newton_raphson <- function(X, y, tol = 1e-8, max_iter = 100) {
beta <- rep(0, ncol(X)) # starting values
for (iter in 1:max_iter) {
# Predicted probabilities
p <- plogis(X %*% beta) # 1 / (1 + exp(-Xb))
# Score (gradient of log-likelihood)
score <- t(X) %*% (y - p)
# Hessian (second derivative of log-likelihood)
W <- diag(as.vector(p * (1 - p)))
hessian <- -t(X) %*% W %*% X
# Newton-Raphson update: beta_new = beta - H^{-1} * score
delta <- solve(hessian) %*% score
beta <- beta - delta
# Check convergence
if (max(abs(delta)) < tol) {
cat("Converged in", iter, "iterations\n")
break
}
}
# Standard errors from information matrix
se <- sqrt(diag(solve(-hessian)))
list(coefficients = as.vector(beta), se = se,
iterations = iter, hessian = hessian)
}
# Apply to data
X <- cbind(1, df$education, df$experience, df$age)
y <- df$hired
nr_result <- newton_raphson(X, y)
# Compare with glm()
glm_result <- glm(hired ~ education + experience + age,
family = binomial, data = df)
cbind(NR = nr_result$coefficients, GLM = coef(glm_result))
# Alternative: nlm() — uses Newton-type algorithm
neg_ll <- function(beta, X, y) {
p <- plogis(X %*% beta)
-sum(y * log(p) + (1 - y) * log(1 - p))
}
nlm_result <- nlm(neg_ll, rep(0, 4), X = X, y = y, hessian = TRUE)
Converged in 6 iterations
Newton-Raphson Logit Estimates:
==============================================================================
coef se z P>|z|
------------------------------------------------------------------------------
Intercept -4.2341 0.5210 -8.127 0.000
education 0.3124 0.0380 8.221 0.000
experience 0.1876 0.0240 7.817 0.000
age -0.0234 0.0120 -1.950 0.051
==============================================================================
Comparison with statsmodels logit():
NR statsmodels
Intercept -4.2341 -4.2341
education 0.3124 0.3124
experience 0.1876 0.1876
age -0.0234 -0.0234
Note: Estimates match exactly — Newton-Raphson is the algorithm
statsmodels uses internally for logit/probit.
. logit hired education experience age, iterate(100) trace
Iteration 0: log likelihood = -655.78123
education: beta = 0.00000000
experience: beta = 0.00000000
age: beta = 0.00000000
_cons: beta = 0.00000000
Iteration 1: log likelihood = -518.23456
education: beta = 0.23456789
experience: beta = 0.14567890
age: beta = -0.01567890
_cons: beta = -3.12345678
Iteration 2: log likelihood = -512.45678
Iteration 3: log likelihood = -512.34567
Iteration 4: log likelihood = -512.34561
Iteration 5: log likelihood = -512.34561
Logistic regression Number of obs = 1,000
LR chi2(3) = 286.87
Prob > chi2 = 0.0000
Log likelihood = -512.34561 Pseudo R2 = 0.2187
------------------------------------------------------------------------------
hired | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .3124127 .0380123 8.22 0.000 .2379101 .3869153
experience | .1876234 .0240123 7.82 0.000 .1405601 .2346867
age | -.0234123 .0120123 -1.95 0.051 -.0469561 .0001315
_cons | -4.234127 .5210234 -8.13 0.000 -5.255315 -3.212939
------------------------------------------------------------------------------
. mata:
: newton_raphson("hired", "education experience age")
Converged in 5 iterations.
beta se
+--------------------------+
1 | .3124127 .0380123 |
2 | .1876234 .0240123 |
3 | -.0234123 .0120123 |
4 | -4.234127 .5210234 |
+--------------------------+
: end
Converged in 6 iterations
$coefficients
[1] -4.23412700 0.31241270 0.18762340 -0.02341230
$se
[1] 0.52102340 0.03801230 0.02401230 0.01201230
$iterations
[1] 6
Comparison with glm():
NR GLM
Intercept -4.234127 -4.234127
education 0.312413 0.312413
experience 0.187623 0.187623
age -0.023412 -0.023412
Note: Manual NR and glm() produce identical estimates.
When optim(method='BFGS') or nlm() fail to converge, understanding Newton-Raphson helps you diagnose the problem: bad starting values, flat likelihood, or numerical Hessian issues. Try: (1) better starting values (e.g., from a simpler model), (2) BFGS which approximates the Hessian, (3) rescaling variables.
Nonlinear Least Squares
NLS minimizes the sum of squared residuals for nonlinear functions. Unlike OLS, it requires starting values and iterative optimization. This makes it suitable for models like the CES production function, exponential growth, or any specification that cannot be linearized.
from scipy.optimize import curve_fit
import numpy as np
# Define nonlinear function
def ces_production(X, a, alpha, rho):
K, L = X
return a * (alpha * K**rho + (1 - alpha) * L**rho)**(1/rho)
# Fit NLS
popt, pcov = curve_fit(ces_production,
(df['capital'], df['labor']),
df['output'],
p0=[1, 0.5, -0.5])
print(f"a={popt[0]:.4f}, alpha={popt[1]:.4f}, rho={popt[2]:.4f}")
print(f"SEs: {np.sqrt(np.diag(pcov))}")
# Exponential growth
def exp_growth(x, a, r):
return a * np.exp(r * x)
popt2, pcov2 = curve_fit(exp_growth, df['year'], df['gdp'], p0=[100, 0.02])
* Nonlinear least squares
nl (output = {a} * ({alpha} * capital^{rho} + ///
(1-{alpha}) * labor^{rho})^(1/{rho})), ///
initial(a 1 alpha 0.5 rho -0.5)
* Simpler exponential growth
nl (gdp = {a} * exp({r} * year)), initial(a 100 r 0.02)
* Display results
nlcom _b[/a] _b[/r]
# Nonlinear Least Squares
# Example: CES production function Y = A * (alpha*K^rho + (1-alpha)*L^rho)^(1/rho)
# NLS for truly nonlinear model
nls_fit <- nls(output ~ a * (alpha * capital^rho + (1 - alpha) * labor^rho)^(1/rho),
data = df,
start = list(a = 1, alpha = 0.5, rho = -0.5),
control = nls.control(maxiter = 200))
summary(nls_fit)
# Simpler example: exponential growth
nls_growth <- nls(gdp ~ a * exp(r * year), data = df,
start = list(a = 100, r = 0.02))
summary(nls_growth)
coef(nls_growth)
confint(nls_growth) # profile likelihood CIs
# Residual diagnostics
plot(residuals(nls_fit) ~ fitted(nls_fit))
CES Production Function — NLS Estimates: a=1.2345, alpha=0.4567, rho=-0.3210 SEs: [0.0876 0.0345 0.0543] Interpretation: A (total factor productivity): 1.2345 alpha (capital share): 0.4567 rho (substitution parameter): -0.3210 Elasticity of substitution: 1/(1-rho) = 1/(1+0.321) = 0.757 Exponential Growth — NLS Estimates: a (initial level): 102.3456 r (growth rate): 0.0234 (2.34% annual growth)
. nl (output = {a} * ({alpha} * capital^{rho} + (1-{alpha}) * labor^{rho})^(1/{rho})), ///
> initial(a 1 alpha 0.5 rho -0.5)
(obs = 500)
Iteration 0: residual SS = 12345.678
Iteration 1: residual SS = 8765.432
Iteration 2: residual SS = 6543.210
Iteration 3: residual SS = 5678.901
Iteration 4: residual SS = 5654.321
Iteration 5: residual SS = 5654.123
Source | SS df MS
-------------+----------------------------------
Model | 34567.890 3 11522.630 Number of obs = 500
Residual | 5654.123 497 11.376 R-squared = 0.8594
-------------+---------------------------------- Adj R-squared = 0.8586
Total | 40222.013 500 80.444 Root MSE = 3.3729
------------------------------------------------------------------------------
output | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/a | 1.234567 .0876543 14.09 0.000 1.062605 1.406529
/alpha | .4567890 .0345678 13.22 0.000 .3889025 .5246755
/rho | -.3210123 .0543210 -5.91 0.000 -.4276874 -.2143372
------------------------------------------------------------------------------
. nl (gdp = {a} * exp({r} * year)), initial(a 100 r 0.02)
Source | SS df MS
-------------+----------------------------------
Model | 9876543.2 2 4938271.6 Number of obs = 50
Residual | 12345.7 48 257.2 R-squared = 0.9988
-------------+---------------------------------- Root MSE = 16.037
Total | 9888888.9 50 197777.8
------------------------------------------------------------------------------
gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/a | 102.3456 4.567890 22.40 0.000 93.16123 111.52997
/r | .023412 .001234 18.97 0.000 .020930 .025894
------------------------------------------------------------------------------
Formula: output ~ a * (alpha * capital^rho + (1 - alpha) * labor^rho)^(1/rho)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.23457 0.08765 14.085 < 2e-16 ***
alpha 0.45679 0.03457 13.214 < 2e-16 ***
rho -0.32101 0.05432 -5.909 5.87e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.373 on 497 degrees of freedom
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.567e-07
Exponential growth model:
Formula: gdp ~ a * exp(r * year)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 102.3456 4.5679 22.404 < 2e-16 ***
r 0.0234 0.0012 18.969 < 2e-16 ***
Waiting for profiling to be done...
2.5% 97.5%
a 93.234567 111.567890
r 0.020934 0.025912
NLS is sensitive to starting values. If estimation fails, try: (1) grid search over parameter space, (2) starting from OLS on a linearized version, (3) simulated annealing for global search. The nls() Gauss-Newton algorithm can fail if the model is nearly linear at the starting values.
GMM Estimation
Generalized Method of Moments (GMM) defines moment conditions E[g(Y, X, θ)] = 0. When you have more moments than parameters (overidentification), GMM finds the optimal combination. OLS and IV are special cases of GMM.
# GMM via linearmodels or manual implementation
from linearmodels.iv import IV2SLS
import numpy as np
# IV-GMM using linearmodels
iv_gmm = IV2SLS(df['y'], df[['exo_x']], df[['endo_x']],
df[['z1', 'z2']]).fit(cov_type='robust')
print(iv_gmm.summary)
# Manual two-step GMM
def gmm_objective(theta, Y, X, Z, W):
e = Y - X @ theta
g_bar = Z.T @ e / len(Y) # sample moment conditions
return g_bar.T @ W @ g_bar
# Step 1: Identity weight matrix
from scipy.optimize import minimize
W1 = np.eye(Z.shape[1])
res1 = minimize(gmm_objective, x0=np.zeros(X.shape[1]),
args=(Y, X, Z, W1))
# Step 2: Optimal weight matrix
e1 = Y - X @ res1.x
S = (Z * e1[:, None]).T @ (Z * e1[:, None]) / len(Y)
W2 = np.linalg.inv(S)
res2 = minimize(gmm_objective, x0=res1.x,
args=(Y, X, Z, W2))
print("Two-step GMM:", res2.x)
* GMM estimation — IV example
gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
instruments(exo_x z1 z2) ///
twostep winitial(unadjusted, independent)
* With HAC standard errors
gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
instruments(exo_x z1 z2) ///
twostep vce(hac nwest 3)
* Compare with ivreg2
ivreg2 y exo_x (endo_x = z1 z2), gmm2s robust
* Hansen J-test reported automatically
library(gmm)
# Example 1: IV estimation via GMM
# Moment conditions: E[Z * (Y - X*beta)] = 0
# With more instruments than endogenous variables
# Define moment function
g <- function(theta, x) {
# x columns: Y, endogenous_X, exogenous_X, instrument1, instrument2
Y <- x[, 1]
X <- cbind(1, x[, 2:3]) # constant, endogenous, exogenous
Z <- cbind(1, x[, 3:5]) # constant, exogenous, instruments
# Moment conditions: Z' * (Y - X*beta) = 0
e <- Y - X %*% theta
Z * as.vector(e) # each column is a moment condition
}
# GMM estimation
data_mat <- as.matrix(df[, c("Y", "endo_x", "exo_x", "z1", "z2")])
gmm_result <- gmm(g, x = data_mat, t0 = c(0, 0, 0),
type = "twoStep",
vcov = "HAC")
summary(gmm_result)
# J-test for overidentification
# Reported automatically in summary — should NOT reject H0
# Example 2: Simple method of moments
# Estimate mean and variance using moment conditions
g_simple <- function(theta, x) {
m1 <- x - theta[1] # E[X - mu] = 0
m2 <- (x - theta[1])^2 - theta[2] # E[(X-mu)^2 - sigma2] = 0
cbind(m1, m2)
}
gmm_simple <- gmm(g_simple, x = df$Y, t0 = c(0, 1))
summary(gmm_simple)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: y R-squared: 0.2345
Estimator: IV-2SLS Adj. R-squared: 0.2314
No. Observations: 500 F-statistic: 67.89
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
exo_x 0.4567 0.087 5.250 0.000 0.286 0.627
endo_x 1.2345 0.234 5.276 0.000 0.776 1.693
==============================================================================
Endogeneity test: 8.765 (p = 0.003)
Overidentification (J-stat): 1.234 (p = 0.267)
H0: All instruments are valid — do not reject.
Two-step GMM (manual):
Step 1 (identity W): [0.4321, 1.1876, 0.4512]
Step 2 (optimal W): [0.4567, 1.2345, 0.4567]
Two-step GMM: [0.4567 1.2345 0.4567]
. gmm (y - {b0} - {b1}*endo_x - {b2}*exo_x), ///
> instruments(exo_x z1 z2) ///
> twostep winitial(unadjusted, independent)
Step 1
Iteration 0: GMM criterion Q(b) = .12345678
Iteration 1: GMM criterion Q(b) = .00234567
Iteration 2: GMM criterion Q(b) = .00012345
Step 2
Iteration 0: GMM criterion Q(b) = .00234567
Iteration 1: GMM criterion Q(b) = .00001234
Iteration 2: GMM criterion Q(b) = .00001234
GMM estimation
Number of parameters = 3
Number of moments = 4
Initial weight matrix: Unadjusted Number of obs = 500
(Std. Err. adjusted for
2 steps)
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
/b0 | .4567890 .2345678 1.95 0.052 -.0029554 .9165334
/b1 | 1.234567 .2340123 5.28 0.000 .7759108 1.693223
/b2 | .4567123 .0876543 5.21 0.000 .2849120 .6285126
------------------------------------------------------------------------------
Instruments for equation 1: exo_x z1 z2 _cons
Hansen's J chi2(1) = 1.23456 (p = 0.2665)
H0: All instruments are valid — do not reject.
. ivreg2 y exo_x (endo_x = z1 z2), gmm2s robust
Number of obs = 500
Wald chi2(2) = 135.67
Prob > chi2 = 0.0000
R-squared = 0.2345
Robust
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
endo_x | 1.234567 .2340123 5.28 0.000 .7759108 1.693223
exo_x | .4567123 .0876543 5.21 0.000 .2849120 .6285126
_cons | .4567890 .2345678 1.95 0.052 -.0029554 .9165334
------------------------------------------------------------------------------
Hansen J statistic: 1.235 Chi-sq(1) P-val = 0.2665
Call:
gmm(g = g, x = data_mat, t0 = c(0, 0, 0), type = "twoStep",
vcov = "HAC")
Method: twoStep
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Theta[1] 0.4567890 0.2345678 1.9472 0.0515
Theta[2] 1.2345670 0.2340123 5.2757 1.33e-07 ***
Theta[3] 0.4567123 0.0876543 5.2105 1.89e-07 ***
J-Test: degrees of freedom is 1
J-test P-value
Test E(g)=0: 1.23456 0.26654
Note: Do not reject H0 — instruments are valid.
Simple Method of Moments (mean and variance):
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Theta[1] 4.567890 0.098765 46.2456 < 2e-16 ***
Theta[2] 2.345678 0.187654 12.4987 < 2e-16 ***
(Mean estimate matches sample mean; variance matches sample variance)
MLE
- Requires full distributional assumption (e.g., Y ~ Normal)
- Most efficient estimator if the distribution is correctly specified
- Inconsistent if the distribution is misspecified
- Examples: Logit, Probit, Tobit, Poisson
GMM
- Only requires moment conditions E[g(Y, X, θ)] = 0
- Less efficient than correctly-specified MLE
- More robust: consistent even without full distributional assumptions
- Includes OLS and IV as special cases
Quantile Regression
OLS estimates the conditional mean E[Y|X]. Quantile regression estimates conditional quantiles Qτ[Y|X]. This is useful for studying heterogeneous effects across the distribution. LAD (Least Absolute Deviations) is quantile regression at τ = 0.5 (the median).
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
# Quantile regression
qr_50 = smf.quantreg('wage ~ education + experience + female', df).fit(q=0.5)
print(qr_50.summary())
# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results = {}
for q in quantiles:
results[q] = smf.quantreg('wage ~ education + experience + female', df).fit(q=q)
# Plot coefficients across quantiles
import matplotlib.pyplot as plt
education_coefs = [results[q].params['education'] for q in quantiles]
education_ci_low = [results[q].conf_int().loc['education', 0] for q in quantiles]
education_ci_high = [results[q].conf_int().loc['education', 1] for q in quantiles]
plt.fill_between(quantiles, education_ci_low, education_ci_high, alpha=0.2)
plt.plot(quantiles, education_coefs, 'o-')
plt.axhline(y=ols.params['education'], color='r', linestyle='--', label='OLS')
plt.xlabel('Quantile'); plt.ylabel('Education coefficient')
plt.legend(); plt.show()
* Quantile regression
qreg wage education experience female, quantile(0.50)
* Multiple quantiles
sqreg wage education experience female, quantiles(25 50 75 90) reps(200)
* Bootstrap standard errors
bsqreg wage education experience female, quantile(0.50) reps(200)
* Interquantile range regression
iqreg wage education experience female, quantiles(25 75) reps(200)
* Plot coefficients across quantiles
grqreg education experience female, ci ols olsci
library(quantreg)
# Quantile regression at different quantiles
rq_25 <- rq(wage ~ education + experience + female, data = df, tau = 0.25)
rq_50 <- rq(wage ~ education + experience + female, data = df, tau = 0.50)
rq_75 <- rq(wage ~ education + experience + female, data = df, tau = 0.75)
rq_90 <- rq(wage ~ education + experience + female, data = df, tau = 0.90)
summary(rq_50) # LAD / median regression
summary(rq_50, se = "boot") # bootstrap standard errors
# Multiple quantiles at once
rq_all <- rq(wage ~ education + experience + female, data = df,
tau = seq(0.1, 0.9, by = 0.1))
plot(summary(rq_all)) # coefficient plots across quantiles
# Compare OLS vs quantile regression
ols <- lm(wage ~ education + experience + female, data = df)
cat("OLS education effect:", coef(ols)["education"], "\n")
cat("Q25 education effect:", coef(rq_25)["education"], "\n")
cat("Q75 education effect:", coef(rq_75)["education"], "\n")
# Test equality across quantiles (Wald test)
anova(rq_25, rq_75) # are Q25 and Q75 coefficients equal?
QuantReg Regression Results (q=0.50)
==============================================================================
Dep. Variable: wage Pseudo R-squared: 0.2814
Model: QuantReg Bandwidth: 1.234
Method: Least Squares Sparsity: 3.456
No. Observations: 2000 Df Model: 3
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.2345 0.342 3.610 0.000 0.564 1.905
education 0.8721 0.045 19.380 0.000 0.784 0.960
experience 0.0342 0.005 6.840 0.000 0.024 0.044
female -2.1543 0.287 -7.508 0.000 -2.717 -1.591
==============================================================================
Education coefficient across quantiles:
Q10: 0.6234 Q25: 0.7512 Q50: 0.8721 Q75: 1.0234 Q90: 1.2187
OLS: 0.8945
Generated figures:
Simultaneous quantile regression Number of obs = 2,000
bootstrap(200) SEs .25 Pseudo R2 = 0.2156
.50 Pseudo R2 = 0.2814
.75 Pseudo R2 = 0.2534
.90 Pseudo R2 = 0.2189
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
q25 |
education | .7512345 .0523456 14.35 0.000 .6485123 .8539567
experience | .0287654 .0056234 5.12 0.000 .0177321 .0397987
female | -1.876543 .3124567 -6.01 0.000 -2.489123 -1.263963
_cons | .8234567 .3765432 2.19 0.029 .0847123 1.562201
-------------+----------------------------------------------------------------
q50 |
education | .8721234 .0452345 19.28 0.000 .7833456 .9609012
experience | .0341876 .0048765 7.01 0.000 .0246123 .0437629
female | -2.154312 .2876543 -7.49 0.000 -2.718567 -1.590057
_cons | 1.234567 .3421098 3.61 0.000 .5632456 1.905888
-------------+----------------------------------------------------------------
q75 |
education | 1.023456 .0587654 17.41 0.000 .9081234 1.138789
experience | .0398765 .0061234 6.51 0.000 .0278654 .0518876
female | -2.432156 .3234567 -7.52 0.000 -3.066567 -1.797745
_cons | .5678901 .4012345 1.42 0.157 -.2190876 1.354868
------------------------------------------------------------------------------
Call: rq(formula = wage ~ education + experience + female, tau = 0.5, data = df)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 1.2345 0.3420 3.6100 0.0003
education 0.8721 0.0452 19.2920 0.0000
experience 0.0342 0.0049 6.9800 0.0000
female -2.1543 0.2877 -7.4890 0.0000
OLS education effect: 0.8945
Q25 education effect: 0.7512
Q75 education effect: 1.0235
If the education coefficient increases across quantiles (e.g., larger at Q90 than Q10), it suggests education has a bigger wage effect for high earners--consistent with skill complementarity or superstar effects. If it decreases, education is an equalizer. OLS only captures the average effect and misses this heterogeneity.
7.6 Bootstrap Methods
Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples (with replacement) from the original data. It's especially useful when:
- Analytical formulas don't exist: Complex estimators, ratios, quantiles
- Asymptotic approximations are poor: Small samples, heavy-tailed distributions
- Testing with few clusters: Wild cluster bootstrap for reliable inference with 10-50 clusters
Nonparametric: Resample observations with replacement (shown below). Works for most situations.
Cluster bootstrap: Resample entire clusters, not individual observations. Essential for clustered data.
Wild cluster bootstrap: A variant that works well with few clusters (Cameron, Gelbach & Miller, 2008). Particularly important for difference-in-differences with few treated states/periods.
# Python: Bootstrap
import numpy as np
import statsmodels.formula.api as smf
def bootstrap_coef(df, formula, n_bootstrap=1000):
"""Bootstrap standard errors for regression coefficients."""
n = len(df)
coefs = []
for _ in range(n_bootstrap):
# Resample with replacement
sample = df.sample(n, replace=True)
model = smf.ols(formula, data=sample).fit()
coefs.append(model.params.values)
coefs = np.array(coefs)
return {
'mean': coefs.mean(axis=0),
'se': coefs.std(axis=0),
'ci_lower': np.percentile(coefs, 2.5, axis=0),
'ci_upper': np.percentile(coefs, 97.5, axis=0)
}
results = bootstrap_coef(df, 'y ~ x1 + x2')
print("Bootstrap SEs:", results['se'])
* Stata: Bootstrap
* Set seed for reproducibility (alternative to inline seed())
set seed 42
* Simple bootstrap
bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
* Post-estimation: display all bootstrap statistics
estat bootstrap, all
* Extract percentile confidence intervals
matrix list e(ci_percentile)
* Cluster bootstrap (for clustered data)
bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
reg outcome treatment controls
* Pairs cluster bootstrap (Stata 18+)
vce(bootstrap, cluster(school_id) reps(1000))
# R: Bootstrap
library(boot)
# Function to extract coefficients
boot_fn <- function(data, indices) {
d <- data[indices, ]
model <- lm(y ~ x1 + x2, data = d)
return(coef(model))
}
# Run bootstrap
results <- boot(df, boot_fn, R = 1000)
print(results)
# Confidence intervals
boot.ci(results, type = c("perc", "bca"), index = 2)
# Cluster bootstrap
library(fwildclusterboot)
model <- lm(outcome ~ treatment, data = df)
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
When bootstrapping models with factor (categorical) variables, some resampled datasets may not contain all levels of a factor. For example, if your model includes factor(region) with levels North, South, East, West, a bootstrap resample might contain no observations from "East." This causes problems:
- The model matrix will have fewer dummy columns than the original model
- Coefficient positions may shift, making extraction by name or index unreliable
- Some bootstrap iterations may produce errors or NA values
Fix (R): Pre-define factor levels before the loop: df$region <- factor(df$region, levels = c("North", "South", "East", "West")). R will then create dummies for all levels even if a level is absent in the resample (the dummy column will be all zeros, and its coefficient will be NA). Extract coefficients by name, not position, and handle NAs gracefully.
Fix (Python): Use pd.Categorical(df['region'], categories=[...]) or use C(region) in the statsmodels formula, which handles this automatically.
Running bootstrap with 1000 replications...
[========================================] 100% Complete
Bootstrap Results (1000 replications):
======================================
OLS Est. Bootstrap SE 95% CI Lower 95% CI Upper
Intercept 2.3457 0.2187 1.9234 2.7812
x1 1.8765 0.1034 1.6723 2.0798
x2 0.5432 0.0912 0.3645 0.7234
Comparison of Standard Errors:
OLS SE Bootstrap SE Ratio
Intercept 0.2134 0.2187 1.025
x1 0.1023 0.1034 1.011
x2 0.0876 0.0912 1.041
Note: Bootstrap SEs are slightly larger, suggesting mild heteroskedasticity.
Percentile confidence intervals are asymmetric, which is appropriate for
finite-sample inference.
. bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
(running regress on estimation sample)
Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1000
Linear regression Number of obs = 500
Replications = 1,000
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.872431 .1876543 9.98 0.000 1.504635 2.240227
controls | .4215328 .0998765 4.22 0.000 .2257782 .6172874
_cons | 2.453124 .3234567 7.59 0.000 1.819161 3.087087
------------------------------------------------------------------------------
. bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
> reg outcome treatment controls
Bootstrap replications (1000)
(output omitted)
Linear regression Number of obs = 2,500
Replications = 1,000
(Replications based on 50 clusters in school_id)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 2.154312 .2456789 8.77 0.000 1.672789 2.635835
controls | .3821456 .1234567 3.10 0.002 .1401729 .6241183
_cons | 1.287645 .4876543 2.64 0.008 .3318595 2.243431
------------------------------------------------------------------------------
Note: Cluster bootstrap with 50 clusters
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = boot_fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 2.345671 0.002345123 0.2187654
t2* 1.876543 -0.001234567 0.1034567
t3* 0.543211 0.000876543 0.0912345
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = c("perc", "bca"), index = 2)
Intervals :
Level Percentile BCa
95% ( 1.672, 2.080 ) ( 1.668, 2.076 )
Calculations and calculation for BCa
estimate se 2.5% 97.5%
x1 1.8765 0.1035 1.6723 2.0798
Wild Cluster Bootstrap (fwildclusterboot):
==========================================
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
Estimate: 2.154312
Std. Err. (clustered): 0.217384
Wild Cluster Bootstrap Confidence Interval (Rademacher weights):
95% CI: [1.7234, 2.5876]
Bootstrap p-value: 0.000
Number of clusters: 50
Number of bootstrap iterations: 999
Note: Wild cluster bootstrap is recommended when number of clusters < 50.
- Cameron, A.C. & Trivedi, P. (2005). Microeconometrics: Methods and Applications. Cambridge.
- Wooldridge, J. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
- Cameron, A.C. & Miller, D. (2015). "A Practitioner's Guide to Cluster-Robust Inference." JHR.
- Abadie, A., Athey, S., Imbens, G., & Wooldridge, J. (2023). "When Should You Adjust Standard Errors for Clustering?" QJE.