7 Estimation Methods
Learning Objectives
- Distinguish between identification and estimation
- Choose appropriate standard errors for your data structure
- Handle panel data with fixed effects efficiently
- Implement nonlinear models (logit, probit, Poisson)
- Understand maximum likelihood and GMM estimation
Table of Contents
7.1 Identification vs. Estimation
A common source of confusion: identification and estimation are different problems.
Identification (Module 6)
Can we learn the causal effect from data?
- Research design and assumptions
- What variation identifies the effect?
- Parallel trends, exclusion restriction, etc.
- Whether your estimate is causal or just correlational
Estimation (This Module)
How do we compute the estimate from data?
- OLS, MLE, GMM, nonparametric methods
- Standard errors and inference
- Handling large datasets efficiently
- Correcting for clustering, heteroskedasticity
You can have correct identification but bad estimation (wrong standard errors), or good estimation but no identification (precise estimate of a biased number). Both matter.
7.2 Standard Errors and Inference
Choosing the right standard errors is essential for valid inference. Using incorrect standard errors can lead to dramatically wrong conclusions—confidence intervals that are too narrow (false precision) or too wide (lost power). The two main considerations are:
- Heteroskedasticity: When the variance of the error term varies across observations (e.g., income variance differs by wealth level)
- Clustering: When observations are correlated within groups (e.g., students in the same classroom, repeated observations of the same person)
Robust (Heteroskedasticity-Consistent) Standard Errors
Classical OLS standard errors assume homoskedasticity: constant error variance across all observations. In practice, this rarely holds. Robust standard errors (also called Huber-White or sandwich standard errors) remain valid under heteroskedasticity without requiring you to model its form.
The key parameter is cov_type='HC1' (Python), robust or vce(robust) (Stata), or vcov = vcovHC() (R). HC0-HC3 are variants with different small-sample corrections—HC1 is the most common. In the output, compare standard errors: robust SEs are often larger than classical OLS SEs when heteroskedasticity is present, leading to wider confidence intervals and more conservative inference.
# Python: Robust Standard Errors
import statsmodels.formula.api as smf
# Default (homoskedastic) standard errors
model = smf.ols('outcome ~ treatment + controls', data=df).fit()
# Robust (heteroskedasticity-consistent) HC1
model_robust = smf.ols('outcome ~ treatment + controls', data=df).fit(
cov_type='HC1'
)
print(model_robust.summary())
* Stata: Robust Standard Errors
* Default standard errors
reg outcome treatment controls
* Robust (heteroskedasticity-consistent) - ALWAYS use this
reg outcome treatment controls, robust
* Equivalent: vce(robust)
reg outcome treatment controls, vce(robust)
# R: Robust Standard Errors
library(sandwich)
library(lmtest)
# Fit model
model <- lm(outcome ~ treatment + controls, data = df)
# Robust standard errors (HC1)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Using fixest (cleaner syntax)
library(fixest)
model <- feols(outcome ~ treatment + controls, data = df, vcov = "hetero")
summary(model)
OLS Regression Results (Robust)
==============================================================================
Dep. Variable: outcome R-squared: 0.342
Model: OLS Adj. R-squared: 0.339
Method: Least Squares F-statistic: 127.4
Date: Mon, 27 Jan 2026 Prob (F-statistic): 1.23e-47
No. Observations: 500
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.4531 0.312 7.861 0.000 1.839 3.067
treatment 1.8724 0.187 10.014 0.000 1.505 2.240
controls 0.4215 0.098 4.301 0.000 0.229 0.614
==============================================================================
Notes: Standard errors are heteroskedasticity robust (HC1)
Linear regression Number of obs = 500
F(2, 497) = 127.38
Prob > F = 0.0000
R-squared = 0.3418
Root MSE = 2.1847
------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.872431 .1869847 10.01 0.000 1.505087 2.239775
controls | .4215328 .0980127 4.30 0.000 .2289842 .6140814
_cons | 2.453124 .3120458 7.86 0.000 1.839434 3.066814
------------------------------------------------------------------------------
t test of coefficients (Heteroskedasticity-consistent):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.453124 0.312046 7.8612 2.35e-14 ***
treatment 1.872431 0.186985 10.0138 < 2.2e-16 ***
controls 0.421533 0.098013 4.3008 2.01e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clustered Standard Errors
When observations are correlated within groups, standard errors that ignore this correlation will be too small—often dramatically so. Common examples include:
- Students within schools: Students in the same school share teachers, facilities, and peer effects
- Workers within firms: Workers at the same company share management, culture, and industry shocks
- Repeated observations: The same individual observed multiple times over a panel
- Geographic clustering: Individuals in the same state/region share local policies and economic conditions
The clustering variable is specified via cov_kwds={'groups': df['cluster_var']} (Python), cluster(cluster_var) (Stata), or cluster = ~cluster_var (R/fixest). The output will report the number of clusters—a rule of thumb is that you need at least 30-50 clusters for clustered standard errors to be reliable. With few clusters, consider wild cluster bootstrap (shown in Section 7.6).
# Python: Clustered Standard Errors
import statsmodels.formula.api as smf
# Cluster at the school level
model = smf.ols('outcome ~ treatment', data=df).fit(
cov_type='cluster',
cov_kwds={'groups': df['school_id']}
)
# Two-way clustering (e.g., state and year)
from linearmodels.panel import PanelOLS
df_panel = df.set_index(['state', 'year'])
model = PanelOLS.from_formula('outcome ~ treatment', data=df_panel)
results = model.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
* Stata: Clustered Standard Errors
* One-way clustering
reg outcome treatment controls, cluster(school_id)
* Two-way clustering
* ssc install reghdfe
reghdfe outcome treatment, absorb(state year) cluster(state year)
* Check number of clusters
tab school_id, matrow(clusters)
display r(r)
# R: Clustered Standard Errors
library(fixest)
# One-way clustering
model <- feols(outcome ~ treatment + controls,
data = df,
cluster = ~ school_id)
summary(model)
# Two-way clustering
model <- feols(outcome ~ treatment | state + year,
data = df,
cluster = ~ state + year)
# With sandwich package
library(sandwich)
model <- lm(outcome ~ treatment, data = df)
vcovCL(model, cluster = ~ school_id)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.2847
Estimator: PanelOLS R-squared (Between): 0.3124
No. Observations: 2500 R-squared (Within): 0.2847
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.2956
Cov. Estimator: Clustered
F-statistic: 98.237
Entities: 50 P-value 0.0000
Time periods: 50 Distribution: F(1,2448)
Cluster Var: Entity & Time
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
treatment 2.1543 0.2174 9.9098 0.0000 1.7276 2.5810
================================================================================
Number of entity clusters: 50
Number of time clusters: 50
Linear regression Number of obs = 2,500
F(2, 49) = 65.82
Prob > F = 0.0000
R-squared = 0.2847
Root MSE = 1.9234
(Std. Err. adjusted for 50 clusters in school_id)
------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 2.154312 .2173841 9.91 0.000 1.717406 2.591218
controls | .3821456 .1124587 3.40 0.001 .1561547 .6081365
_cons | 1.287645 .4521378 2.85 0.006 .3793121 2.195978
------------------------------------------------------------------------------
Number of clusters (school_id) = 50
OLS estimation, Dep. Var.: outcome
Observations: 2,500
Standard-errors: Clustered (school_id)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.287645 0.452138 2.84791 0.006342 **
treatment 2.154312 0.217384 9.90978 1.23e-13 ***
controls 0.382146 0.112459 3.39830 0.001378 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.92344 Adj. R2: 0.28364
Std-Errors: Clustered (school_id), 50 clusters
General rule: cluster at the level where:
- Treatment is assigned (for experiments)
- There's correlation in unobservables (for observational studies)
When in doubt, cluster at a more aggregate level. See Cameron & Miller (2015) for guidance.
7.3 Panel Data Methods
Panel data (repeated observations of units over time) is the workhorse of applied microeconomics because it enables researchers to control for unobserved heterogeneity. The key intuition: if individuals differ in unobserved ways that affect both treatment and outcomes, cross-sectional comparisons are biased. Panel data lets us compare the same individual over time, differencing out time-invariant unobservables.
Fixed Effects
Fixed effects estimation includes a separate intercept for each unit (person, firm, country), absorbing all time-invariant characteristics. Mechanically, this is equivalent to "demeaning"—subtracting each unit's mean from all its observations. The identifying variation comes only from within-unit changes over time.
The key syntax elements are: entity_effects=True (Python linearmodels), fe or absorb(unit_id) (Stata), and | unit_id after the formula (R fixest). Two-way fixed effects add time fixed effects as well, controlling for common shocks (e.g., recessions) that affect all units.
Important: You can only estimate coefficients on variables that vary within units over time. Time-invariant characteristics (gender, race, birth cohort) are absorbed by the fixed effects and cannot be estimated—but they are controlled for.
# Python: Fixed Effects with linearmodels
from linearmodels.panel import PanelOLS
# Set panel index
df = df.set_index(['unit_id', 'year'])
# Unit fixed effects only
model = PanelOLS.from_formula(
'outcome ~ treatment + time_varying_controls',
data=df,
entity_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
# Two-way fixed effects (unit + time)
model = PanelOLS.from_formula(
'outcome ~ treatment',
data=df,
entity_effects=True,
time_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
* Stata: Fixed Effects
* Declare panel structure
xtset unit_id year
* Unit fixed effects
xtreg outcome treatment time_varying_controls, fe cluster(unit_id)
* Two-way fixed effects with reghdfe (RECOMMENDED for speed)
reghdfe outcome treatment, absorb(unit_id year) cluster(unit_id)
* With multiple sets of fixed effects
reghdfe outcome treatment, absorb(unit_id year industry_id) cluster(unit_id)
# R: Fixed Effects with fixest
library(fixest)
# Unit fixed effects
model <- feols(outcome ~ treatment + controls | unit_id,
data = df,
cluster = ~ unit_id)
summary(model)
# Two-way fixed effects
model <- feols(outcome ~ treatment | unit_id + year,
data = df,
cluster = ~ unit_id)
# Multiple sets of fixed effects
model <- feols(outcome ~ treatment | unit_id + year + industry_id,
data = df,
cluster = ~ unit_id)
# With plm package (older, slower)
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
model <- plm(outcome ~ treatment, data = pdata, model = "within")
PanelOLS Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.4218
Estimator: PanelOLS R-squared (Between): 0.0147
No. Observations: 5000 R-squared (Within): 0.4218
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.3156
Time: 14:32:17 Log-likelihood -8547.2
Cov. Estimator: Clustered
F-statistic: 182.45
Entities: 500 P-value 0.0000
Time periods: 10 Distribution: F(2,4497)
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------------
treatment 1.4872 0.1102 13.4954 0.0000 1.2712 1.7032
time_varying_ctrl 0.8234 0.0876 9.3995 0.0000 0.6516 0.9952
================================================================================
F-test for Poolability: 4.2341
P-value: 0.0000
Distribution: F(499,4497)
Included effects: Entity, Time
. xtset unit_id year
panel variable: unit_id (strongly balanced)
time variable: year, 2010 to 2019
delta: 1 unit
. xtreg outcome treatment time_varying_controls, fe cluster(unit_id)
Fixed-effects (within) regression Number of obs = 5,000
Group variable: unit_id Number of groups = 500
R-sq: Obs per group:
within = 0.4218 min = 10
between = 0.0147 avg = 10.0
overall = 0.3156 max = 10
F(2,499) = 182.45
corr(u_i, Xb) = 0.0234 Prob > F = 0.0000
(Std. Err. adjusted for 500 clusters in unit_id)
----------------------------------------------------------------------------------
| Robust
outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
treatment | 1.487218 .1102341 13.50 0.000 1.270598 1.703838
time_varying_ctr | .8234127 .0876234 9.40 0.000 .6512847 .9955407
_cons | 2.134567 .2341876 9.11 0.000 1.674451 2.594683
-----------------+----------------------------------------------------------------
sigma_u | 1.8723456
sigma_e | 1.2345678
rho | .69712345 (fraction of variance due to u_i)
----------------------------------------------------------------------------------
OLS estimation, Dep. Var.: outcome
Observations: 5,000
Fixed-effects: unit_id: 500, year: 10
Standard-errors: Clustered (unit_id)
Estimate Std. Error t value Pr(>|t|)
treatment 1.487218 0.110234 13.49536 < 2.2e-16 ***
time_varying_ctrl 0.823413 0.087623 9.39950 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.23457 Adj. R2: 0.42018
Fixef resid. var.: 1.52416 (500 + 10 FEs)
Random Effects
Random effects (RE) treats unit-specific effects as random draws from a distribution rather than fixed parameters. This is more efficient than fixed effects (smaller standard errors) but requires a stronger assumption: the unit effects must be uncorrelated with the regressors. If this assumption fails, RE estimates are inconsistent.
In practice, RE is rarely preferred in economics research because the key assumption (unit effects uncorrelated with regressors) is hard to justify. The Hausman test compares FE and RE: if they differ significantly, RE is inconsistent and FE should be used. RE remains useful when you need to estimate coefficients on time-invariant variables, or when sample sizes are small and efficiency gains matter.
# Python: Random Effects
from linearmodels.panel import RandomEffects
df = df.set_index(['unit_id', 'year'])
model = RandomEffects.from_formula(
'outcome ~ treatment + controls',
data=df
)
results = model.fit()
# Hausman test: FE vs RE
from linearmodels.panel import compare
fe_model = PanelOLS.from_formula('outcome ~ treatment', data=df, entity_effects=True).fit()
re_model = RandomEffects.from_formula('outcome ~ treatment', data=df).fit()
print(compare({'FE': fe_model, 'RE': re_model}))
* Stata: Random Effects
* Random effects model
xtreg outcome treatment controls, re
* Hausman test
xtreg outcome treatment controls, fe
estimates store fe
xtreg outcome treatment controls, re
hausman fe .
# R: Random Effects
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
# Random effects model
re_model <- plm(outcome ~ treatment + controls,
data = pdata,
model = "random")
summary(re_model)
# Hausman test
fe_model <- plm(outcome ~ treatment + controls, data = pdata, model = "within")
phtest(fe_model, re_model)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: outcome R-squared: 0.3524
Estimator: RandomEffects R-squared (Between): 0.4012
No. Observations: 5000 R-squared (Within): 0.3187
Date: Mon, 27 Jan 2026 R-squared (Overall): 0.3524
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
Intercept 1.8923 0.1876 10.0865 0.0000 1.5246 2.2600
treatment 1.5234 0.0987 15.4346 0.0000 1.3299 1.7169
controls 0.7821 0.0654 11.9587 0.0000 0.6539 0.9103
================================================================================
Model Comparison
================================================================================
FE RE
--------------------------------------------------------------------------------
Dep. Variable outcome outcome
Estimator PanelOLS RandomEffects
No. Observations 5000 5000
Cov. Estimator Unadjusted Unadjusted
R-squared 0.3187 0.3524
FE RE
--------------------------------------------------------------------------------
treatment 1.4872 1.5234
(0.1102) (0.0987)
================================================================================
Hausman test statistic: 8.234
P-value: 0.0041
Conclusion: Reject H0 at 5% level - use Fixed Effects
. xtreg outcome treatment controls, re
Random-effects GLS regression Number of obs = 5,000
Group variable: unit_id Number of groups = 500
R-sq: Obs per group:
within = 0.3187 min = 10
between = 0.4012 avg = 10.0
overall = 0.3524 max = 10
Wald chi2(2) = 476.23
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.523412 .0987234 15.43 0.000 1.329919 1.716905
controls | .7821345 .0653876 11.96 0.000 .6539769 .9102921
_cons | 1.892345 .1876234 10.09 0.000 1.524612 2.260078
-------------+----------------------------------------------------------------
sigma_u | 1.4523456
sigma_e | 1.2345678
rho | .58012345 (fraction of variance due to u_i)
------------------------------------------------------------------------------
. hausman fe .
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fe . Difference S.E.
-------------+----------------------------------------------------------------
treatment | 1.487218 1.523412 -.0361943 .0412187
controls | .8234127 .7821345 .0412782 .0287654
------------------------------------------------------------------------------
b = consistent under Ho and Ha
B = inconsistent under Ha, efficient under Ho
Test: Ho: difference in coefficients not systematic
chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 8.23
Prob>chi2 = 0.0041
(V_b-V_B is not positive definite)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = outcome ~ treatment + controls, data = pdata, model = "random")
Balanced Panel: n = 500, T = 10, N = 5000
Effects:
var std.dev share
idiosyncratic 1.5235 1.2343 0.42
individual 2.1098 1.4525 0.58
theta: 0.7821
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.23456 -0.82345 0.01234 0.84567 4.56789
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 1.892345 0.187623 10.086 < 2.2e-16 ***
treatment 1.523412 0.098723 15.435 < 2.2e-16 ***
controls 0.782135 0.065388 11.959 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 12345.6
Residual Sum of Squares: 7987.23
R-Squared: 0.35243
Adj. R-Squared: 0.35217
Chisq: 476.23 on 2 DF, p-value: < 2.22e-16
Hausman Test
data: outcome ~ treatment + controls
chisq = 8.234, df = 2, p-value = 0.004123
alternative hypothesis: one model is inconsistent
7.4 Nonlinear Models
Linear models assume the conditional mean of Y given X is a linear function of parameters. This breaks down when:
- Binary outcomes: Linear probability models can predict probabilities outside [0,1]
- Count data: Counts are non-negative integers, but OLS can predict negative values
- Bounded/skewed outcomes: Expenditure data, durations, and other inherently positive variables
Nonlinear models address these issues by modeling a transformed version of the outcome or using appropriate distributional assumptions.
Logit and Probit
For binary outcomes (0/1), logit and probit models estimate the probability P(Y=1|X). They differ in the link function: logit uses the logistic CDF, probit uses the normal CDF. In practice, they give very similar results—the choice is often a matter of convention (probit is traditional in labor economics, logit in epidemiology).
Raw logit/probit coefficients are in log-odds (logit) or z-scores (probit), which are hard to interpret. Always compute marginal effects—the change in probability for a unit change in X. Average marginal effects (AME) average across all observations; marginal effects at the mean (MEM) evaluate at sample means. AME is generally preferred as it doesn't depend on a potentially unrepresentative "average" person.
# Python: Logit and Probit
import statsmodels.formula.api as smf
# Logit model (binary outcome)
logit = smf.logit('hired ~ education + experience + age', data=df).fit()
print(logit.summary())
# Probit model
probit = smf.probit('hired ~ education + experience + age', data=df).fit()
# Marginal effects (more interpretable)
margeff = logit.get_margeff()
print(margeff.summary())
# Average marginal effects at specific values
margeff_at_means = logit.get_margeff(at='mean')
* Stata: Logit and Probit
* Logit
logit hired education experience age
* Probit
probit hired education experience age
* Marginal effects (average marginal effects)
margins, dydx(*)
* Marginal effects at specific values
margins, dydx(*) at(education=16 experience=5)
* Odds ratios (for logit)
logit hired education experience age, or
# R: Logit and Probit
# Logit
logit <- glm(hired ~ education + experience + age,
data = df,
family = binomial(link = "logit"))
summary(logit)
# Probit
probit <- glm(hired ~ education + experience + age,
data = df,
family = binomial(link = "probit"))
# Marginal effects with margins package
library(margins)
margeff <- margins(logit)
summary(margeff)
# Odds ratios
exp(coef(logit))
Optimization terminated successfully.
Current function value: 0.512345
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: hired No. Observations: 1000
Model: Logit Df Residuals: 996
Method: MLE Df Model: 3
Date: Mon, 27 Jan 2026 Pseudo R-squ.: 0.2187
Converged: True Log-Likelihood: -512.35
LLR p-value: 1.23e-42 LL-Null: -655.78
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -4.2341 0.521 -8.127 0.000 -5.255 -3.213
education 0.3124 0.038 8.221 0.000 0.238 0.387
experience 0.1876 0.024 7.817 0.000 0.141 0.235
age -0.0234 0.012 -1.950 0.051 -0.047 0.000
==============================================================================
Logit Marginal Effects
=====================================
Dep. Variable: hired
Method: dydx
At: overall
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
education 0.0654 0.008 8.175 0.000 0.050 0.081
experience 0.0393 0.005 7.860 0.000 0.030 0.049
age -0.0049 0.003 -1.633 0.102 -0.011 0.001
==============================================================================
. logit hired education experience age
Iteration 0: log likelihood = -655.78123
Iteration 1: log likelihood = -518.23456
Iteration 2: log likelihood = -512.45678
Iteration 3: log likelihood = -512.34567
Iteration 4: log likelihood = -512.34561
Logistic regression Number of obs = 1,000
LR chi2(3) = 286.87
Prob > chi2 = 0.0000
Log likelihood = -512.34561 Pseudo R2 = 0.2187
------------------------------------------------------------------------------
hired | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .3124127 .0380123 8.22 0.000 .2379101 .3869153
experience | .1876234 .0240123 7.82 0.000 .1405601 .2346867
age | -.0234123 .0120123 -1.95 0.051 -.0469561 .0001315
_cons | -4.234127 .5210234 -8.13 0.000 -5.255315 -3.212939
------------------------------------------------------------------------------
. margins, dydx(*)
Average marginal effects Number of obs = 1,000
Model VCE : OIM
Expression : Pr(hired), predict()
dy/dx w.r.t. : education experience age
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .0654231 .0080012 8.18 0.000 .0497411 .0811051
experience | .0392876 .0049987 7.86 0.000 .0294904 .0490848
age | -.0049012 .0030012 -1.63 0.102 -.0107834 .0009810
------------------------------------------------------------------------------
. logit hired education experience age, or
Logistic regression Number of obs = 1,000
LR chi2(3) = 286.87
Log likelihood = -512.34561 Pseudo R2 = 0.2187
------------------------------------------------------------------------------
hired | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | 1.366789 .0519432 8.22 0.000 1.268798 1.472756
experience | 1.206423 .0289634 7.82 0.000 1.150923 1.264567
age | .9768543 .0117234 -1.95 0.051 .9541234 .9999876
_cons | .0145234 .0075645 -8.13 0.000 .0052345 .0402567
------------------------------------------------------------------------------
Call:
glm(formula = hired ~ education + experience + age, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3456 -0.8765 0.3456 0.7654 2.1234
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.234127 0.521023 -8.127 4.4e-16 ***
education 0.312413 0.038012 8.221 < 2e-16 ***
experience 0.187623 0.024012 7.817 5.4e-15 ***
age -0.023412 0.012012 -1.950 0.0512 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1311.6 on 999 degrees of freedom
Residual deviance: 1024.7 on 996 degrees of freedom
AIC: 1032.7
Number of Fisher Scoring iterations: 4
factor AME SE z p lower upper
education 0.0654 0.0080 8.1750 0.0000 0.0497 0.0811
experience 0.0393 0.0050 7.8600 0.0000 0.0295 0.0491
age -0.0049 0.0030 -1.6333 0.1024 -0.0108 0.0010
Odds ratios:
(Intercept) education experience age
0.0145234 1.3667894 1.2064231 0.9768543
Poisson and Negative Binomial (Count Data)
Count outcomes (number of patents, doctor visits, accidents) require models that respect their non-negative, integer nature. The Poisson model assumes E[Y|X] = exp(Xβ) and Var(Y|X) = E[Y|X]—the mean equals the variance. This "equidispersion" assumption often fails in practice; when variance exceeds the mean (overdispersion), use negative binomial instead.
An important insight: Poisson regression with robust standard errors is consistent even if the data aren't Poisson-distributed, as long as the conditional mean is correctly specified. This "Poisson Pseudo-MLE" is widely used in trade economics for gravity equations (Santos Silva & Tenreyro, 2006) because it handles zeros naturally and is robust to heteroskedasticity.
# Python: Poisson Regression
import statsmodels.formula.api as smf
# Poisson (count data)
poisson = smf.poisson('num_patents ~ rd_spending + firm_size', data=df).fit()
print(poisson.summary())
# Negative binomial (allows overdispersion)
from statsmodels.discrete.discrete_model import NegativeBinomial
nb = smf.negativebinomial('num_patents ~ rd_spending + firm_size', data=df).fit()
# Robust Poisson (PPML for trade gravity equations)
poisson_robust = smf.poisson('trade ~ distance + gdp_origin + gdp_dest', data=df).fit(
cov_type='HC1'
)
* Stata: Count Data Models
* Poisson
poisson num_patents rd_spending firm_size
* Negative binomial
nbreg num_patents rd_spending firm_size
* Poisson with robust SEs (PPML)
ppmlhdfe trade distance gdp_origin gdp_dest, absorb(origin dest) cluster(pair)
* Check for overdispersion
estat gof
# R: Count Data Models
# Poisson
poisson <- glm(num_patents ~ rd_spending + firm_size,
data = df,
family = poisson())
summary(poisson)
# Negative binomial
library(MASS)
nb <- glm.nb(num_patents ~ rd_spending + firm_size, data = df)
summary(nb)
# With fixest (fast, with fixed effects)
library(fixest)
poisson_fe <- fepois(num_patents ~ rd_spending | firm_id + year,
data = df)
Optimization terminated successfully.
Current function value: 2.345678
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: num_patents No. Observations: 800
Model: Poisson Df Residuals: 797
Method: MLE Df Model: 2
Date: Mon, 27 Jan 2026 Pseudo R-squ.: 0.1876
Converged: True Log-Likelihood: -1876.5
LLR p-value: 3.45e-38 LL-Null: -2310.2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.4523 0.087 5.199 0.000 0.282 0.623
rd_spending 0.0234 0.003 7.800 0.000 0.018 0.029
firm_size 0.0876 0.012 7.300 0.000 0.064 0.111
==============================================================================
Interpretation: A 1 million increase in R&D spending is associated with
exp(0.0234) = 1.024 times more patents (2.4% increase).
Mean of num_patents: 4.32
Variance of num_patents: 12.87
Dispersion ratio: 2.98 (evidence of overdispersion)
. poisson num_patents rd_spending firm_size
Iteration 0: log likelihood = -2310.2345
Iteration 1: log likelihood = -1889.4567
Iteration 2: log likelihood = -1876.7890
Iteration 3: log likelihood = -1876.5432
Iteration 4: log likelihood = -1876.5432
Poisson regression Number of obs = 800
LR chi2(2) = 867.38
Prob > chi2 = 0.0000
Log likelihood = -1876.5432 Pseudo R2 = 0.1876
------------------------------------------------------------------------------
num_patents | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rd_spending | .0234123 .0030012 7.80 0.000 .0175301 .0292945
firm_size | .0876234 .0120012 7.30 0.000 .0640816 .1111652
_cons | .4523456 .0870123 5.20 0.000 .2818036 .6228876
------------------------------------------------------------------------------
. estat gof
Goodness-of-fit chi2 = 2345.678
Prob > chi2(797) = 0.0000
Note: Large chi2 suggests overdispersion. Consider negative binomial.
. nbreg num_patents rd_spending firm_size
Negative binomial regression Number of obs = 800
LR chi2(2) = 234.56
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -1654.3210 Pseudo R2 = 0.0662
------------------------------------------------------------------------------
num_patents | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rd_spending | .0198765 .0042345 4.70 0.000 .0115771 .0281759
firm_size | .0756432 .0165432 4.57 0.000 .0432191 .1080673
_cons | .5234567 .1123456 4.66 0.000 .3032615 .7436519
-------------+----------------------------------------------------------------
/lnalpha | .4567891 .0876543 .2849888 .6285894
-------------+----------------------------------------------------------------
alpha | 1.578912 .1384567 1.329876 1.874987
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 444.44 Prob >= chibar2 = 0.000
Call:
glm(formula = num_patents ~ rd_spending + firm_size, family = poisson(),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4567 -1.2345 -0.2345 0.8765 4.5678
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.452346 0.087012 5.199 2.01e-07 ***
rd_spending 0.023412 0.003001 7.800 6.21e-15 ***
firm_size 0.087623 0.012001 7.300 2.87e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4620.5 on 799 degrees of freedom
Residual deviance: 3753.1 on 797 degrees of freedom
AIC: 3759.1
Overdispersion test: ratio = 2.98 (variance/mean)
Consider negative binomial regression.
Call: glm.nb(formula = num_patents ~ rd_spending + firm_size)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.523457 0.112346 4.659 3.17e-06 ***
rd_spending 0.019877 0.004235 4.695 2.67e-06 ***
firm_size 0.075643 0.016543 4.573 4.80e-06 ***
Theta: 1.5789
Std. Err.: 0.1385
2 x log-likelihood: -3304.642
7.5 MLE and GMM
Maximum Likelihood Estimation (MLE) and Generalized Method of Moments (GMM) are the two general frameworks underlying most econometric estimators. Understanding them helps you: (1) implement custom models not available in standard packages, (2) understand the properties of standard estimators, and (3) diagnose estimation problems.
Maximum Likelihood (MLE)
Find parameters that maximize the probability of observing the data.
- Requires specifying the full distribution of Y|X
- Efficient if the distribution is correct
- Inconsistent if distribution is misspecified
- Examples: Logit, Probit, Tobit, Poisson
Generalized Method of Moments (GMM)
Find parameters that make moment conditions hold in the sample.
- Only requires specifying moment conditions (E[g(Y,X,θ)]=0)
- More robust but less efficient than MLE
- Includes OLS and IV as special cases
- Examples: IV/2SLS, panel GMM, difference GMM
Custom MLE
Sometimes you need to estimate a model that isn't available in standard packages. The approach is straightforward: (1) write down the log-likelihood function, (2) use numerical optimization to find the maximum. The code below shows a simple example—linear regression estimated via MLE rather than OLS.
The key components are: (1) defining the negative log-likelihood function (negative because optimizers minimize by default), (2) choosing starting values, and (3) selecting an optimization algorithm (BFGS is a good default). Standard errors come from the inverse of the Hessian matrix at the optimum. For complex models, check that the optimizer converged and try multiple starting values.
# Python: Custom MLE
from scipy.optimize import minimize
import numpy as np
def neg_log_likelihood(params, X, y):
"""Negative log-likelihood for linear regression."""
beta = params[:-1]
sigma = params[-1]
if sigma <= 0:
return np.inf
y_hat = X @ beta
residuals = y - y_hat
# Normal log-likelihood
ll = -0.5 * len(y) * np.log(2 * np.pi * sigma**2)
ll -= np.sum(residuals**2) / (2 * sigma**2)
return -ll # negative because we minimize
# Estimate
X = np.column_stack([np.ones(len(df)), df['x1'], df['x2']])
y = df['y'].values
initial = np.zeros(X.shape[1] + 1)
initial[-1] = 1 # initial sigma
result = minimize(neg_log_likelihood, initial, args=(X, y), method='BFGS')
print("MLE estimates:", result.x)
* Stata: Custom MLE with ml
* Define the log-likelihood function
program mylogit_ll
args lnf xb
quietly replace `lnf' = $ML_y1 * `xb' - ln(1 + exp(`xb'))
end
* Estimate
ml model lf mylogit_ll (outcome = x1 x2)
ml maximize
# R: Custom MLE
# Define negative log-likelihood
neg_ll <- function(params, X, y) {
beta <- params[-length(params)]
sigma <- params[length(params)]
if (sigma <= 0) return(Inf)
y_hat <- X %*% beta
resid <- y - y_hat
ll <- -0.5 * length(y) * log(2 * pi * sigma^2)
ll <- ll - sum(resid^2) / (2 * sigma^2)
return(-ll)
}
# Estimate
X <- cbind(1, df$x1, df$x2)
y <- df$y
initial <- c(rep(0, ncol(X)), 1)
result <- optim(initial, neg_ll, X = X, y = y, method = "BFGS", hessian = TRUE)
print(result$par)
# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))
Optimization terminated successfully.
Current function value: 567.234567
Iterations: 23
Function evaluations: 45
Gradient evaluations: 45
MLE estimates: [ 2.34567123 1.87654321 0.54321098 1.23456789]
Parameter interpretation:
beta_0 (intercept): 2.3457
beta_1 (x1): 1.8765
beta_2 (x2): 0.5432
sigma: 1.2346
Comparison with OLS:
MLE OLS
Intercept 2.3457 2.3457
x1 1.8765 1.8765
x2 0.5432 0.5432
Sigma/RMSE 1.2346 1.2346
Note: MLE and OLS give identical point estimates for linear regression
with normally distributed errors. MLE uses n in denominator for sigma,
OLS uses (n-k) giving slightly larger estimate.
. ml model lf mylogit_ll (outcome = x1 x2)
. ml maximize
initial: log likelihood = -345.67891
alternative: log likelihood = -234.56789
rescale: log likelihood = -189.01234
Iteration 0: log likelihood = -189.01234
Iteration 1: log likelihood = -156.78901
Iteration 2: log likelihood = -145.67890
Iteration 3: log likelihood = -145.23456
Iteration 4: log likelihood = -145.23451
Iteration 5: log likelihood = -145.23451
Maximum likelihood estimation Number of obs = 500
Wald chi2(2) = 123.45
Log likelihood = -145.23451 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1.234567 .1234567 10.00 0.000 .9925922 1.476542
x2 | .5678901 .0876543 6.48 0.000 .3960899 .7396903
_cons | -2.345678 .3456789 -6.79 0.000 -3.023197 -1.668159
------------------------------------------------------------------------------
Note: Custom MLE converged successfully.
Log-likelihood at maximum: -145.23451
$par
[1] 2.34567123 1.87654321 0.54321098 1.23456789
$value
[1] 567.2346
$counts
function gradient
45 45
$convergence
[1] 0
$message
NULL
Parameter estimates:
Intercept: 2.3457
x1: 1.8765
x2: 0.5432
sigma: 1.2346
Standard errors (from Hessian):
SE(Intercept): 0.2134
SE(x1): 0.1023
SE(x2): 0.0876
SE(sigma): 0.0392
95% Confidence intervals:
Estimate SE Lower Upper
Intercept 2.3457 0.2134 1.9274 2.7640
x1 1.8765 0.1023 1.6760 2.0770
x2 0.5432 0.0876 0.3715 0.7149
sigma 1.2346 0.0392 1.1578 1.3114
Log-likelihood at maximum: -567.2346
AIC: 1142.469
BIC: 1159.234
7.6 Bootstrap Methods
Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples (with replacement) from the original data. It's especially useful when:
- Analytical formulas don't exist: Complex estimators, ratios, quantiles
- Asymptotic approximations are poor: Small samples, heavy-tailed distributions
- Testing with few clusters: Wild cluster bootstrap for reliable inference with 10-50 clusters
Nonparametric: Resample observations with replacement (shown below). Works for most situations.
Cluster bootstrap: Resample entire clusters, not individual observations. Essential for clustered data.
Wild cluster bootstrap: A variant that works well with few clusters (Cameron, Gelbach & Miller, 2008). Particularly important for difference-in-differences with few treated states/periods.
# Python: Bootstrap
import numpy as np
import statsmodels.formula.api as smf
def bootstrap_coef(df, formula, n_bootstrap=1000):
"""Bootstrap standard errors for regression coefficients."""
n = len(df)
coefs = []
for _ in range(n_bootstrap):
# Resample with replacement
sample = df.sample(n, replace=True)
model = smf.ols(formula, data=sample).fit()
coefs.append(model.params.values)
coefs = np.array(coefs)
return {
'mean': coefs.mean(axis=0),
'se': coefs.std(axis=0),
'ci_lower': np.percentile(coefs, 2.5, axis=0),
'ci_upper': np.percentile(coefs, 97.5, axis=0)
}
results = bootstrap_coef(df, 'y ~ x1 + x2')
print("Bootstrap SEs:", results['se'])
* Stata: Bootstrap
* Simple bootstrap
bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
* Cluster bootstrap (for clustered data)
bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
reg outcome treatment controls
* Pairs cluster bootstrap (Stata 18+)
vce(bootstrap, cluster(school_id) reps(1000))
# R: Bootstrap
library(boot)
# Function to extract coefficients
boot_fn <- function(data, indices) {
d <- data[indices, ]
model <- lm(y ~ x1 + x2, data = d)
return(coef(model))
}
# Run bootstrap
results <- boot(df, boot_fn, R = 1000)
print(results)
# Confidence intervals
boot.ci(results, type = c("perc", "bca"), index = 2)
# Cluster bootstrap
library(fwildclusterboot)
model <- lm(outcome ~ treatment, data = df)
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
Running bootstrap with 1000 replications...
[========================================] 100% Complete
Bootstrap Results (1000 replications):
======================================
OLS Est. Bootstrap SE 95% CI Lower 95% CI Upper
Intercept 2.3457 0.2187 1.9234 2.7812
x1 1.8765 0.1034 1.6723 2.0798
x2 0.5432 0.0912 0.3645 0.7234
Comparison of Standard Errors:
OLS SE Bootstrap SE Ratio
Intercept 0.2134 0.2187 1.025
x1 0.1023 0.1034 1.011
x2 0.0876 0.0912 1.041
Note: Bootstrap SEs are slightly larger, suggesting mild heteroskedasticity.
Percentile confidence intervals are asymmetric, which is appropriate for
finite-sample inference.
. bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
(running regress on estimation sample)
Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1000
Linear regression Number of obs = 500
Replications = 1,000
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 1.872431 .1876543 9.98 0.000 1.504635 2.240227
controls | .4215328 .0998765 4.22 0.000 .2257782 .6172874
_cons | 2.453124 .3234567 7.59 0.000 1.819161 3.087087
------------------------------------------------------------------------------
. bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
> reg outcome treatment controls
Bootstrap replications (1000)
(output omitted)
Linear regression Number of obs = 2,500
Replications = 1,000
(Replications based on 50 clusters in school_id)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | 2.154312 .2456789 8.77 0.000 1.672789 2.635835
controls | .3821456 .1234567 3.10 0.002 .1401729 .6241183
_cons | 1.287645 .4876543 2.64 0.008 .3318595 2.243431
------------------------------------------------------------------------------
Note: Cluster bootstrap with 50 clusters
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = boot_fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 2.345671 0.002345123 0.2187654
t2* 1.876543 -0.001234567 0.1034567
t3* 0.543211 0.000876543 0.0912345
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = c("perc", "bca"), index = 2)
Intervals :
Level Percentile BCa
95% ( 1.672, 2.080 ) ( 1.668, 2.076 )
Calculations and calculation for BCa
estimate se 2.5% 97.5%
x1 1.8765 0.1035 1.6723 2.0798
Wild Cluster Bootstrap (fwildclusterboot):
==========================================
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
Estimate: 2.154312
Std. Err. (clustered): 0.217384
Wild Cluster Bootstrap Confidence Interval (Rademacher weights):
95% CI: [1.7234, 2.5876]
Bootstrap p-value: 0.000
Number of clusters: 50
Number of bootstrap iterations: 999
Note: Wild cluster bootstrap is recommended when number of clusters < 50.
- Cameron, A.C. & Trivedi, P. (2005). Microeconometrics: Methods and Applications. Cambridge.
- Wooldridge, J. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
- Cameron, A.C. & Miller, D. (2015). "A Practitioner's Guide to Cluster-Robust Inference." JHR.
- Abadie, A., Athey, S., Imbens, G., & Wooldridge, J. (2023). "When Should You Adjust Standard Errors for Clustering?" QJE.