7  Estimation Methods

~6 hours Standard Errors, Panel Data, Nonlinear Models Intermediate-Advanced

Learning Objectives

  • Distinguish between identification and estimation
  • Choose appropriate standard errors for your data structure
  • Handle panel data with fixed effects efficiently
  • Implement nonlinear models (logit, probit, Poisson)
  • Understand maximum likelihood and GMM estimation

7.1 Identification vs. Estimation

A common source of confusion: identification and estimation are different problems.

Identification (Module 6)

Can we learn the causal effect from data?

  • Research design and assumptions
  • What variation identifies the effect?
  • Parallel trends, exclusion restriction, etc.
  • Whether your estimate is causal or just correlational

Estimation (This Module)

How do we compute the estimate from data?

  • OLS, MLE, GMM, nonparametric methods
  • Standard errors and inference
  • Handling large datasets efficiently
  • Correcting for clustering, heteroskedasticity

You can have correct identification but bad estimation (wrong standard errors), or good estimation but no identification (precise estimate of a biased number). Both matter.

7.2 Standard Errors and Inference

Choosing the right standard errors is essential for valid inference. Using incorrect standard errors can lead to dramatically wrong conclusions—confidence intervals that are too narrow (false precision) or too wide (lost power). The two main considerations are:

  • Heteroskedasticity: When the variance of the error term varies across observations (e.g., income variance differs by wealth level)
  • Clustering: When observations are correlated within groups (e.g., students in the same classroom, repeated observations of the same person)

Robust (Heteroskedasticity-Consistent) Standard Errors

Classical OLS standard errors assume homoskedasticity: constant error variance across all observations. In practice, this rarely holds. Robust standard errors (also called Huber-White or sandwich standard errors) remain valid under heteroskedasticity without requiring you to model its form.

What to look for in the code

The key parameter is cov_type='HC1' (Python), robust or vce(robust) (Stata), or vcov = vcovHC() (R). HC0-HC3 are variants with different small-sample corrections—HC1 is the most common. In the output, compare standard errors: robust SEs are often larger than classical OLS SEs when heteroskedasticity is present, leading to wider confidence intervals and more conservative inference.

# Python: Robust Standard Errors
import statsmodels.formula.api as smf

# Default (homoskedastic) standard errors
model = smf.ols('outcome ~ treatment + controls', data=df).fit()

# Robust (heteroskedasticity-consistent) HC1
model_robust = smf.ols('outcome ~ treatment + controls', data=df).fit(
    cov_type='HC1'
)
print(model_robust.summary())
* Stata: Robust Standard Errors

* Default standard errors
reg outcome treatment controls

* Robust (heteroskedasticity-consistent) - ALWAYS use this
reg outcome treatment controls, robust

* Equivalent: vce(robust)
reg outcome treatment controls, vce(robust)
# R: Robust Standard Errors
library(sandwich)
library(lmtest)

# Fit model
model <- lm(outcome ~ treatment + controls, data = df)

# Robust standard errors (HC1)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Using fixest (cleaner syntax)
library(fixest)
model <- feols(outcome ~ treatment + controls, data = df, vcov = "hetero")
summary(model)
Python Output
                            OLS Regression Results (Robust)
==============================================================================
Dep. Variable:                outcome   R-squared:                       0.342
Model:                            OLS   Adj. R-squared:                  0.339
Method:                 Least Squares   F-statistic:                     127.4
Date:                Mon, 27 Jan 2026   Prob (F-statistic):           1.23e-47
No. Observations:                 500
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4531      0.312      7.861      0.000       1.839       3.067
treatment      1.8724      0.187     10.014      0.000       1.505       2.240
controls       0.4215      0.098      4.301      0.000       0.229       0.614
==============================================================================
Notes: Standard errors are heteroskedasticity robust (HC1)
Stata Output
Linear regression                               Number of obs     =        500
                                                F(2, 497)         =     127.38
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3418
                                                Root MSE          =     2.1847

------------------------------------------------------------------------------
             |               Robust
     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.872431   .1869847    10.01   0.000     1.505087    2.239775
    controls |   .4215328   .0980127     4.30   0.000     .2289842    .6140814
       _cons |   2.453124   .3120458     7.86   0.000     1.839434    3.066814
------------------------------------------------------------------------------
R Output
t test of coefficients (Heteroskedasticity-consistent):

             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  2.453124   0.312046  7.8612 2.35e-14 ***
treatment    1.872431   0.186985 10.0138 < 2.2e-16 ***
controls     0.421533   0.098013  4.3008 2.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clustered Standard Errors

When observations are correlated within groups, standard errors that ignore this correlation will be too small—often dramatically so. Common examples include:

  • Students within schools: Students in the same school share teachers, facilities, and peer effects
  • Workers within firms: Workers at the same company share management, culture, and industry shocks
  • Repeated observations: The same individual observed multiple times over a panel
  • Geographic clustering: Individuals in the same state/region share local policies and economic conditions
What to look for in the code

The clustering variable is specified via cov_kwds={'groups': df['cluster_var']} (Python), cluster(cluster_var) (Stata), or cluster = ~cluster_var (R/fixest). The output will report the number of clusters—a rule of thumb is that you need at least 30-50 clusters for clustered standard errors to be reliable. With few clusters, consider wild cluster bootstrap (shown in Section 7.6).

# Python: Clustered Standard Errors
import statsmodels.formula.api as smf

# Cluster at the school level
model = smf.ols('outcome ~ treatment', data=df).fit(
    cov_type='cluster',
    cov_kwds={'groups': df['school_id']}
)

# Two-way clustering (e.g., state and year)
from linearmodels.panel import PanelOLS
df_panel = df.set_index(['state', 'year'])
model = PanelOLS.from_formula('outcome ~ treatment', data=df_panel)
results = model.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)
* Stata: Clustered Standard Errors

* One-way clustering
reg outcome treatment controls, cluster(school_id)

* Two-way clustering
* ssc install reghdfe
reghdfe outcome treatment, absorb(state year) cluster(state year)

* Check number of clusters
tab school_id, matrow(clusters)
display r(r)
# R: Clustered Standard Errors
library(fixest)

# One-way clustering
model <- feols(outcome ~ treatment + controls,
               data = df,
               cluster = ~ school_id)
summary(model)

# Two-way clustering
model <- feols(outcome ~ treatment | state + year,
               data = df,
               cluster = ~ state + year)

# With sandwich package
library(sandwich)
model <- lm(outcome ~ treatment, data = df)
vcovCL(model, cluster = ~ school_id)
Python Output
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.2847
Estimator:                   PanelOLS   R-squared (Between):              0.3124
No. Observations:                2500   R-squared (Within):               0.2847
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.2956
Cov. Estimator:             Clustered
                                        F-statistic:                      98.237
Entities:                          50   P-value                           0.0000
Time periods:                      50   Distribution:                  F(1,2448)
Cluster Var:           Entity & Time

                             Parameter Estimates
================================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
treatment      2.1543     0.2174     9.9098     0.0000      1.7276      2.5810
================================================================================
Number of entity clusters: 50
Number of time clusters: 50
Stata Output
Linear regression                               Number of obs     =      2,500
                                                F(2, 49)          =      65.82
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2847
                                                Root MSE          =     1.9234

                               (Std. Err. adjusted for 50 clusters in school_id)
------------------------------------------------------------------------------
             |               Robust
     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   2.154312   .2173841     9.91   0.000     1.717406    2.591218
    controls |   .3821456   .1124587     3.40   0.001     .1561547    .6081365
       _cons |   1.287645   .4521378     2.85   0.006     .3793121    2.195978
------------------------------------------------------------------------------

Number of clusters (school_id) = 50
R Output
OLS estimation, Dep. Var.: outcome
Observations: 2,500
Standard-errors: Clustered (school_id)
             Estimate Std. Error  t value  Pr(>|t|)
(Intercept)  1.287645   0.452138  2.84791  0.006342 **
treatment    2.154312   0.217384  9.90978 1.23e-13 ***
controls     0.382146   0.112459  3.39830  0.001378 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.92344   Adj. R2: 0.28364
                 Std-Errors: Clustered (school_id), 50 clusters
When to Cluster

General rule: cluster at the level where:

  • Treatment is assigned (for experiments)
  • There's correlation in unobservables (for observational studies)

When in doubt, cluster at a more aggregate level. See Cameron & Miller (2015) for guidance.

7.3 Panel Data Methods

Panel data (repeated observations of units over time) is the workhorse of applied microeconomics because it enables researchers to control for unobserved heterogeneity. The key intuition: if individuals differ in unobserved ways that affect both treatment and outcomes, cross-sectional comparisons are biased. Panel data lets us compare the same individual over time, differencing out time-invariant unobservables.

Fixed Effects

Fixed effects estimation includes a separate intercept for each unit (person, firm, country), absorbing all time-invariant characteristics. Mechanically, this is equivalent to "demeaning"—subtracting each unit's mean from all its observations. The identifying variation comes only from within-unit changes over time.

What to look for in the code

The key syntax elements are: entity_effects=True (Python linearmodels), fe or absorb(unit_id) (Stata), and | unit_id after the formula (R fixest). Two-way fixed effects add time fixed effects as well, controlling for common shocks (e.g., recessions) that affect all units.

Important: You can only estimate coefficients on variables that vary within units over time. Time-invariant characteristics (gender, race, birth cohort) are absorbed by the fixed effects and cannot be estimated—but they are controlled for.

# Python: Fixed Effects with linearmodels
from linearmodels.panel import PanelOLS

# Set panel index
df = df.set_index(['unit_id', 'year'])

# Unit fixed effects only
model = PanelOLS.from_formula(
    'outcome ~ treatment + time_varying_controls',
    data=df,
    entity_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)

# Two-way fixed effects (unit + time)
model = PanelOLS.from_formula(
    'outcome ~ treatment',
    data=df,
    entity_effects=True,
    time_effects=True
)
results = model.fit(cov_type='clustered', cluster_entity=True)
* Stata: Fixed Effects

* Declare panel structure
xtset unit_id year

* Unit fixed effects
xtreg outcome treatment time_varying_controls, fe cluster(unit_id)

* Two-way fixed effects with reghdfe (RECOMMENDED for speed)
reghdfe outcome treatment, absorb(unit_id year) cluster(unit_id)

* With multiple sets of fixed effects
reghdfe outcome treatment, absorb(unit_id year industry_id) cluster(unit_id)
# R: Fixed Effects with fixest
library(fixest)

# Unit fixed effects
model <- feols(outcome ~ treatment + controls | unit_id,
               data = df,
               cluster = ~ unit_id)
summary(model)

# Two-way fixed effects
model <- feols(outcome ~ treatment | unit_id + year,
               data = df,
               cluster = ~ unit_id)

# Multiple sets of fixed effects
model <- feols(outcome ~ treatment | unit_id + year + industry_id,
               data = df,
               cluster = ~ unit_id)

# With plm package (older, slower)
library(plm)
pdata <- pdata.frame(df, index = c("unit_id", "year"))
model <- plm(outcome ~ treatment, data = pdata, model = "within")
Python Output
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.4218
Estimator:                   PanelOLS   R-squared (Between):              0.0147
No. Observations:                5000   R-squared (Within):               0.4218
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.3156
Time:                        14:32:17   Log-likelihood                   -8547.2
Cov. Estimator:             Clustered
                                        F-statistic:                      182.45
Entities:                         500   P-value                           0.0000
Time periods:                      10   Distribution:                  F(2,4497)

                             Parameter Estimates
================================================================================
                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------------
treatment             1.4872     0.1102    13.4954     0.0000      1.2712      1.7032
time_varying_ctrl     0.8234     0.0876     9.3995     0.0000      0.6516      0.9952
================================================================================

F-test for Poolability: 4.2341
P-value: 0.0000
Distribution: F(499,4497)

Included effects: Entity, Time
Stata Output
. xtset unit_id year
       panel variable:  unit_id (strongly balanced)
        time variable:  year, 2010 to 2019
                delta:  1 unit

. xtreg outcome treatment time_varying_controls, fe cluster(unit_id)

Fixed-effects (within) regression               Number of obs     =      5,000
Group variable: unit_id                         Number of groups  =        500

R-sq:                                           Obs per group:
     within  = 0.4218                                         min =         10
     between = 0.0147                                         avg =       10.0
     overall = 0.3156                                         max =         10

                                                F(2,499)          =     182.45
corr(u_i, Xb)  = 0.0234                         Prob > F          =     0.0000

                                  (Std. Err. adjusted for 500 clusters in unit_id)
----------------------------------------------------------------------------------
                 |               Robust
         outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
       treatment |   1.487218   .1102341    13.50   0.000     1.270598    1.703838
time_varying_ctr |   .8234127   .0876234     9.40   0.000     .6512847    .9955407
           _cons |   2.134567   .2341876     9.11   0.000     1.674451    2.594683
-----------------+----------------------------------------------------------------
         sigma_u |  1.8723456
         sigma_e |  1.2345678
             rho |  .69712345   (fraction of variance due to u_i)
----------------------------------------------------------------------------------
R Output
OLS estimation, Dep. Var.: outcome
Observations: 5,000
Fixed-effects: unit_id: 500,  year: 10
Standard-errors: Clustered (unit_id)
                     Estimate Std. Error   t value   Pr(>|t|)
treatment            1.487218   0.110234  13.49536  < 2.2e-16 ***
time_varying_ctrl    0.823413   0.087623   9.39950  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.23457     Adj. R2: 0.42018
Fixef resid. var.: 1.52416  (500 + 10 FEs)

Random Effects

Random effects (RE) treats unit-specific effects as random draws from a distribution rather than fixed parameters. This is more efficient than fixed effects (smaller standard errors) but requires a stronger assumption: the unit effects must be uncorrelated with the regressors. If this assumption fails, RE estimates are inconsistent.

When to use Random Effects

In practice, RE is rarely preferred in economics research because the key assumption (unit effects uncorrelated with regressors) is hard to justify. The Hausman test compares FE and RE: if they differ significantly, RE is inconsistent and FE should be used. RE remains useful when you need to estimate coefficients on time-invariant variables, or when sample sizes are small and efficiency gains matter.

# Python: Random Effects
from linearmodels.panel import RandomEffects

df = df.set_index(['unit_id', 'year'])

model = RandomEffects.from_formula(
    'outcome ~ treatment + controls',
    data=df
)
results = model.fit()

# Hausman test: FE vs RE
from linearmodels.panel import compare
fe_model = PanelOLS.from_formula('outcome ~ treatment', data=df, entity_effects=True).fit()
re_model = RandomEffects.from_formula('outcome ~ treatment', data=df).fit()
print(compare({'FE': fe_model, 'RE': re_model}))
* Stata: Random Effects

* Random effects model
xtreg outcome treatment controls, re

* Hausman test
xtreg outcome treatment controls, fe
estimates store fe
xtreg outcome treatment controls, re
hausman fe .
# R: Random Effects
library(plm)

pdata <- pdata.frame(df, index = c("unit_id", "year"))

# Random effects model
re_model <- plm(outcome ~ treatment + controls,
                data = pdata,
                model = "random")
summary(re_model)

# Hausman test
fe_model <- plm(outcome ~ treatment + controls, data = pdata, model = "within")
phtest(fe_model, re_model)
Python Output
                        RandomEffects Estimation Summary
================================================================================
Dep. Variable:                outcome   R-squared:                        0.3524
Estimator:              RandomEffects   R-squared (Between):              0.4012
No. Observations:                5000   R-squared (Within):               0.3187
Date:                Mon, 27 Jan 2026   R-squared (Overall):              0.3524

                             Parameter Estimates
================================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
Intercept      1.8923     0.1876    10.0865     0.0000      1.5246      2.2600
treatment      1.5234     0.0987    15.4346     0.0000      1.3299      1.7169
controls       0.7821     0.0654    11.9587     0.0000      0.6539      0.9103
================================================================================

                        Model Comparison
================================================================================
                                     FE              RE
--------------------------------------------------------------------------------
Dep. Variable                   outcome         outcome
Estimator                      PanelOLS    RandomEffects
No. Observations                   5000            5000
Cov. Estimator                  Unadjusted      Unadjusted
R-squared                         0.3187          0.3524

                                     FE              RE
--------------------------------------------------------------------------------
treatment                        1.4872          1.5234
                               (0.1102)        (0.0987)
================================================================================

Hausman test statistic: 8.234
P-value: 0.0041
Conclusion: Reject H0 at 5% level - use Fixed Effects
Stata Output
. xtreg outcome treatment controls, re

Random-effects GLS regression                   Number of obs     =      5,000
Group variable: unit_id                         Number of groups  =        500

R-sq:                                           Obs per group:
     within  = 0.3187                                         min =         10
     between = 0.4012                                         avg =       10.0
     overall = 0.3524                                         max =         10

                                                Wald chi2(2)      =     476.23
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.523412   .0987234    15.43   0.000     1.329919    1.716905
    controls |   .7821345   .0653876    11.96   0.000     .6539769    .9102921
       _cons |   1.892345   .1876234    10.09   0.000     1.524612    2.260078
-------------+----------------------------------------------------------------
     sigma_u |  1.4523456
     sigma_e |  1.2345678
         rho |  .58012345   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. hausman fe .

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |       fe           .          Difference          S.E.
-------------+----------------------------------------------------------------
   treatment |    1.487218     1.523412       -.0361943        .0412187
    controls |    .8234127     .7821345        .0412782        .0287654
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha
            B = inconsistent under Ha, efficient under Ho

    Test:  Ho:  difference in coefficients not systematic

                  chi2(2) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =        8.23
                Prob>chi2 =      0.0041
                (V_b-V_B is not positive definite)
R Output
Oneway (individual) effect Random Effect Model
   (Swamy-Arora's transformation)

Call:
plm(formula = outcome ~ treatment + controls, data = pdata, model = "random")

Balanced Panel: n = 500, T = 10, N = 5000

Effects:
                  var std.dev share
idiosyncratic 1.5235   1.2343  0.42
individual    2.1098   1.4525  0.58
theta: 0.7821

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max.
-4.23456  -0.82345   0.01234   0.84567   4.56789

Coefficients:
             Estimate Std. Error z-value  Pr(>|z|)
(Intercept)  1.892345   0.187623  10.086 < 2.2e-16 ***
treatment    1.523412   0.098723  15.435 < 2.2e-16 ***
controls     0.782135   0.065388  11.959 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    12345.6
Residual Sum of Squares: 7987.23
R-Squared:      0.35243
Adj. R-Squared: 0.35217
Chisq: 476.23 on 2 DF, p-value: < 2.22e-16

	Hausman Test

data:  outcome ~ treatment + controls
chisq = 8.234, df = 2, p-value = 0.004123
alternative hypothesis: one model is inconsistent

7.4 Nonlinear Models

Linear models assume the conditional mean of Y given X is a linear function of parameters. This breaks down when:

  • Binary outcomes: Linear probability models can predict probabilities outside [0,1]
  • Count data: Counts are non-negative integers, but OLS can predict negative values
  • Bounded/skewed outcomes: Expenditure data, durations, and other inherently positive variables

Nonlinear models address these issues by modeling a transformed version of the outcome or using appropriate distributional assumptions.

Logit and Probit

For binary outcomes (0/1), logit and probit models estimate the probability P(Y=1|X). They differ in the link function: logit uses the logistic CDF, probit uses the normal CDF. In practice, they give very similar results—the choice is often a matter of convention (probit is traditional in labor economics, logit in epidemiology).

Interpreting the output

Raw logit/probit coefficients are in log-odds (logit) or z-scores (probit), which are hard to interpret. Always compute marginal effects—the change in probability for a unit change in X. Average marginal effects (AME) average across all observations; marginal effects at the mean (MEM) evaluate at sample means. AME is generally preferred as it doesn't depend on a potentially unrepresentative "average" person.

# Python: Logit and Probit
import statsmodels.formula.api as smf

# Logit model (binary outcome)
logit = smf.logit('hired ~ education + experience + age', data=df).fit()
print(logit.summary())

# Probit model
probit = smf.probit('hired ~ education + experience + age', data=df).fit()

# Marginal effects (more interpretable)
margeff = logit.get_margeff()
print(margeff.summary())

# Average marginal effects at specific values
margeff_at_means = logit.get_margeff(at='mean')
* Stata: Logit and Probit

* Logit
logit hired education experience age

* Probit
probit hired education experience age

* Marginal effects (average marginal effects)
margins, dydx(*)

* Marginal effects at specific values
margins, dydx(*) at(education=16 experience=5)

* Odds ratios (for logit)
logit hired education experience age, or
# R: Logit and Probit

# Logit
logit <- glm(hired ~ education + experience + age,
             data = df,
             family = binomial(link = "logit"))
summary(logit)

# Probit
probit <- glm(hired ~ education + experience + age,
              data = df,
              family = binomial(link = "probit"))

# Marginal effects with margins package
library(margins)
margeff <- margins(logit)
summary(margeff)

# Odds ratios
exp(coef(logit))
Python Output
Optimization terminated successfully.
         Current function value: 0.512345
         Iterations 6
                           Logit Regression Results
==============================================================================
Dep. Variable:                  hired   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      996
Method:                           MLE   Df Model:                            3
Date:                Mon, 27 Jan 2026   Pseudo R-squ.:                  0.2187
Converged:                       True   Log-Likelihood:                -512.35
LLR p-value:                 1.23e-42   LL-Null:                       -655.78
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -4.2341      0.521     -8.127      0.000      -5.255      -3.213
education      0.3124      0.038      8.221      0.000       0.238       0.387
experience     0.1876      0.024      7.817      0.000       0.141       0.235
age           -0.0234      0.012     -1.950      0.051      -0.047       0.000
==============================================================================

        Logit Marginal Effects
=====================================
Dep. Variable:                  hired
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
education      0.0654      0.008      8.175      0.000       0.050       0.081
experience     0.0393      0.005      7.860      0.000       0.030       0.049
age           -0.0049      0.003     -1.633      0.102      -0.011       0.001
==============================================================================
Stata Output
. logit hired education experience age

Iteration 0:   log likelihood = -655.78123
Iteration 1:   log likelihood = -518.23456
Iteration 2:   log likelihood = -512.45678
Iteration 3:   log likelihood = -512.34567
Iteration 4:   log likelihood = -512.34561

Logistic regression                             Number of obs     =      1,000
                                                LR chi2(3)        =     286.87
                                                Prob > chi2       =     0.0000
Log likelihood = -512.34561                     Pseudo R2         =     0.2187

------------------------------------------------------------------------------
       hired |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .3124127   .0380123     8.22   0.000     .2379101    .3869153
  experience |   .1876234   .0240123     7.82   0.000     .1405601    .2346867
         age |  -.0234123   .0120123    -1.95   0.051    -.0469561    .0001315
       _cons |  -4.234127   .5210234    -8.13   0.000    -5.255315   -3.212939
------------------------------------------------------------------------------

. margins, dydx(*)

Average marginal effects                        Number of obs     =      1,000
Model VCE    : OIM

Expression   : Pr(hired), predict()
dy/dx w.r.t. : education experience age

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .0654231   .0080012     8.18   0.000     .0497411    .0811051
  experience |   .0392876   .0049987     7.86   0.000     .0294904    .0490848
         age |  -.0049012   .0030012    -1.63   0.102    -.0107834    .0009810
------------------------------------------------------------------------------

. logit hired education experience age, or

Logistic regression                             Number of obs     =      1,000
                                                LR chi2(3)        =     286.87
Log likelihood = -512.34561                     Pseudo R2         =     0.2187

------------------------------------------------------------------------------
       hired | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   1.366789   .0519432     8.22   0.000     1.268798    1.472756
  experience |   1.206423   .0289634     7.82   0.000     1.150923    1.264567
         age |   .9768543   .0117234    -1.95   0.051     .9541234    .9999876
       _cons |   .0145234   .0075645    -8.13   0.000     .0052345    .0402567
------------------------------------------------------------------------------
R Output
Call:
glm(formula = hired ~ education + experience + age, family = binomial(link = "logit"),
    data = df)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.3456  -0.8765   0.3456   0.7654   2.1234

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.234127   0.521023  -8.127  4.4e-16 ***
education    0.312413   0.038012   8.221  < 2e-16 ***
experience   0.187623   0.024012   7.817  5.4e-15 ***
age         -0.023412   0.012012  -1.950   0.0512 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1311.6  on 999  degrees of freedom
Residual deviance: 1024.7  on 996  degrees of freedom
AIC: 1032.7

Number of Fisher Scoring iterations: 4

 factor     AME     SE       z      p   lower   upper
 education  0.0654  0.0080   8.1750 0.0000  0.0497  0.0811
 experience 0.0393  0.0050   7.8600 0.0000  0.0295  0.0491
 age       -0.0049  0.0030  -1.6333 0.1024 -0.0108  0.0010

Odds ratios:
(Intercept)   education  experience         age
  0.0145234   1.3667894   1.2064231   0.9768543

Poisson and Negative Binomial (Count Data)

Count outcomes (number of patents, doctor visits, accidents) require models that respect their non-negative, integer nature. The Poisson model assumes E[Y|X] = exp(Xβ) and Var(Y|X) = E[Y|X]—the mean equals the variance. This "equidispersion" assumption often fails in practice; when variance exceeds the mean (overdispersion), use negative binomial instead.

Pseudo-Poisson MLE (PPML)

An important insight: Poisson regression with robust standard errors is consistent even if the data aren't Poisson-distributed, as long as the conditional mean is correctly specified. This "Poisson Pseudo-MLE" is widely used in trade economics for gravity equations (Santos Silva & Tenreyro, 2006) because it handles zeros naturally and is robust to heteroskedasticity.

# Python: Poisson Regression
import statsmodels.formula.api as smf

# Poisson (count data)
poisson = smf.poisson('num_patents ~ rd_spending + firm_size', data=df).fit()
print(poisson.summary())

# Negative binomial (allows overdispersion)
from statsmodels.discrete.discrete_model import NegativeBinomial
nb = smf.negativebinomial('num_patents ~ rd_spending + firm_size', data=df).fit()

# Robust Poisson (PPML for trade gravity equations)
poisson_robust = smf.poisson('trade ~ distance + gdp_origin + gdp_dest', data=df).fit(
    cov_type='HC1'
)
* Stata: Count Data Models

* Poisson
poisson num_patents rd_spending firm_size

* Negative binomial
nbreg num_patents rd_spending firm_size

* Poisson with robust SEs (PPML)
ppmlhdfe trade distance gdp_origin gdp_dest, absorb(origin dest) cluster(pair)

* Check for overdispersion
estat gof
# R: Count Data Models

# Poisson
poisson <- glm(num_patents ~ rd_spending + firm_size,
               data = df,
               family = poisson())
summary(poisson)

# Negative binomial
library(MASS)
nb <- glm.nb(num_patents ~ rd_spending + firm_size, data = df)
summary(nb)

# With fixest (fast, with fixed effects)
library(fixest)
poisson_fe <- fepois(num_patents ~ rd_spending | firm_id + year,
                     data = df)
Python Output
Optimization terminated successfully.
         Current function value: 2.345678
         Iterations 5
                          Poisson Regression Results
==============================================================================
Dep. Variable:            num_patents   No. Observations:                  800
Model:                        Poisson   Df Residuals:                      797
Method:                           MLE   Df Model:                            2
Date:                Mon, 27 Jan 2026   Pseudo R-squ.:                  0.1876
Converged:                       True   Log-Likelihood:                -1876.5
LLR p-value:                 3.45e-38   LL-Null:                       -2310.2
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4523      0.087      5.199      0.000       0.282       0.623
rd_spending    0.0234      0.003      7.800      0.000       0.018       0.029
firm_size      0.0876      0.012      7.300      0.000       0.064       0.111
==============================================================================

Interpretation: A 1 million increase in R&D spending is associated with
exp(0.0234) = 1.024 times more patents (2.4% increase).

Mean of num_patents: 4.32
Variance of num_patents: 12.87
Dispersion ratio: 2.98 (evidence of overdispersion)
Stata Output
. poisson num_patents rd_spending firm_size

Iteration 0:   log likelihood = -2310.2345
Iteration 1:   log likelihood = -1889.4567
Iteration 2:   log likelihood = -1876.7890
Iteration 3:   log likelihood = -1876.5432
Iteration 4:   log likelihood = -1876.5432

Poisson regression                              Number of obs     =        800
                                                LR chi2(2)        =     867.38
                                                Prob > chi2       =     0.0000
Log likelihood = -1876.5432                     Pseudo R2         =     0.1876

------------------------------------------------------------------------------
 num_patents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 rd_spending |   .0234123   .0030012     7.80   0.000     .0175301    .0292945
   firm_size |   .0876234   .0120012     7.30   0.000     .0640816    .1111652
       _cons |   .4523456   .0870123     5.20   0.000     .2818036    .6228876
------------------------------------------------------------------------------

. estat gof

         Goodness-of-fit chi2  =  2345.678
         Prob > chi2(797)      =    0.0000

Note: Large chi2 suggests overdispersion. Consider negative binomial.

. nbreg num_patents rd_spending firm_size

Negative binomial regression                    Number of obs     =        800
                                                LR chi2(2)        =     234.56
Dispersion     = mean                           Prob > chi2       =     0.0000
Log likelihood = -1654.3210                     Pseudo R2         =     0.0662

------------------------------------------------------------------------------
 num_patents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 rd_spending |   .0198765   .0042345     4.70   0.000     .0115771    .0281759
   firm_size |   .0756432   .0165432     4.57   0.000     .0432191    .1080673
       _cons |   .5234567   .1123456     4.66   0.000     .3032615    .7436519
-------------+----------------------------------------------------------------
    /lnalpha |   .4567891   .0876543                      .2849888    .6285894
-------------+----------------------------------------------------------------
       alpha |   1.578912   .1384567                      1.329876    1.874987
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 444.44 Prob >= chibar2 = 0.000
R Output
Call:
glm(formula = num_patents ~ rd_spending + firm_size, family = poisson(),
    data = df)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.4567  -1.2345  -0.2345   0.8765   4.5678

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.452346   0.087012   5.199 2.01e-07 ***
rd_spending  0.023412   0.003001   7.800 6.21e-15 ***
firm_size    0.087623   0.012001   7.300 2.87e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4620.5  on 799  degrees of freedom
Residual deviance: 3753.1  on 797  degrees of freedom
AIC: 3759.1

Overdispersion test: ratio = 2.98 (variance/mean)
Consider negative binomial regression.

Call: glm.nb(formula = num_patents ~ rd_spending + firm_size)

              Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.523457   0.112346   4.659 3.17e-06 ***
rd_spending   0.019877   0.004235   4.695 2.67e-06 ***
firm_size     0.075643   0.016543   4.573 4.80e-06 ***

Theta:  1.5789
Std. Err.:  0.1385

2 x log-likelihood:  -3304.642

7.5 MLE and GMM

Maximum Likelihood Estimation (MLE) and Generalized Method of Moments (GMM) are the two general frameworks underlying most econometric estimators. Understanding them helps you: (1) implement custom models not available in standard packages, (2) understand the properties of standard estimators, and (3) diagnose estimation problems.

Maximum Likelihood (MLE)

Find parameters that maximize the probability of observing the data.

  • Requires specifying the full distribution of Y|X
  • Efficient if the distribution is correct
  • Inconsistent if distribution is misspecified
  • Examples: Logit, Probit, Tobit, Poisson

Generalized Method of Moments (GMM)

Find parameters that make moment conditions hold in the sample.

  • Only requires specifying moment conditions (E[g(Y,X,θ)]=0)
  • More robust but less efficient than MLE
  • Includes OLS and IV as special cases
  • Examples: IV/2SLS, panel GMM, difference GMM

Custom MLE

Sometimes you need to estimate a model that isn't available in standard packages. The approach is straightforward: (1) write down the log-likelihood function, (2) use numerical optimization to find the maximum. The code below shows a simple example—linear regression estimated via MLE rather than OLS.

What to look for in the code

The key components are: (1) defining the negative log-likelihood function (negative because optimizers minimize by default), (2) choosing starting values, and (3) selecting an optimization algorithm (BFGS is a good default). Standard errors come from the inverse of the Hessian matrix at the optimum. For complex models, check that the optimizer converged and try multiple starting values.

# Python: Custom MLE
from scipy.optimize import minimize
import numpy as np

def neg_log_likelihood(params, X, y):
    """Negative log-likelihood for linear regression."""
    beta = params[:-1]
    sigma = params[-1]
    if sigma <= 0:
        return np.inf

    y_hat = X @ beta
    residuals = y - y_hat

    # Normal log-likelihood
    ll = -0.5 * len(y) * np.log(2 * np.pi * sigma**2)
    ll -= np.sum(residuals**2) / (2 * sigma**2)
    return -ll  # negative because we minimize

# Estimate
X = np.column_stack([np.ones(len(df)), df['x1'], df['x2']])
y = df['y'].values
initial = np.zeros(X.shape[1] + 1)
initial[-1] = 1  # initial sigma

result = minimize(neg_log_likelihood, initial, args=(X, y), method='BFGS')
print("MLE estimates:", result.x)
* Stata: Custom MLE with ml

* Define the log-likelihood function
program mylogit_ll
    args lnf xb
    quietly replace `lnf' = $ML_y1 * `xb' - ln(1 + exp(`xb'))
end

* Estimate
ml model lf mylogit_ll (outcome = x1 x2)
ml maximize
# R: Custom MLE

# Define negative log-likelihood
neg_ll <- function(params, X, y) {
  beta <- params[-length(params)]
  sigma <- params[length(params)]
  if (sigma <= 0) return(Inf)

  y_hat <- X %*% beta
  resid <- y - y_hat

  ll <- -0.5 * length(y) * log(2 * pi * sigma^2)
  ll <- ll - sum(resid^2) / (2 * sigma^2)
  return(-ll)
}

# Estimate
X <- cbind(1, df$x1, df$x2)
y <- df$y
initial <- c(rep(0, ncol(X)), 1)

result <- optim(initial, neg_ll, X = X, y = y, method = "BFGS", hessian = TRUE)
print(result$par)

# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))
Python Output
Optimization terminated successfully.
         Current function value: 567.234567
         Iterations: 23
         Function evaluations: 45
         Gradient evaluations: 45

MLE estimates: [ 2.34567123  1.87654321  0.54321098  1.23456789]

Parameter interpretation:
  beta_0 (intercept): 2.3457
  beta_1 (x1):        1.8765
  beta_2 (x2):        0.5432
  sigma:              1.2346

Comparison with OLS:
                   MLE          OLS
  Intercept      2.3457       2.3457
  x1             1.8765       1.8765
  x2             0.5432       0.5432
  Sigma/RMSE     1.2346       1.2346

Note: MLE and OLS give identical point estimates for linear regression
with normally distributed errors. MLE uses n in denominator for sigma,
OLS uses (n-k) giving slightly larger estimate.
Stata Output
. ml model lf mylogit_ll (outcome = x1 x2)

. ml maximize

initial:       log likelihood = -345.67891
alternative:   log likelihood = -234.56789
rescale:       log likelihood = -189.01234
Iteration 0:   log likelihood = -189.01234
Iteration 1:   log likelihood = -156.78901
Iteration 2:   log likelihood = -145.67890
Iteration 3:   log likelihood = -145.23456
Iteration 4:   log likelihood = -145.23451
Iteration 5:   log likelihood = -145.23451

Maximum likelihood estimation                   Number of obs     =        500

                                                Wald chi2(2)      =     123.45
Log likelihood = -145.23451                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.234567   .1234567    10.00   0.000     .9925922    1.476542
          x2 |   .5678901   .0876543     6.48   0.000     .3960899    .7396903
       _cons |  -2.345678   .3456789    -6.79   0.000    -3.023197   -1.668159
------------------------------------------------------------------------------

Note: Custom MLE converged successfully.
Log-likelihood at maximum: -145.23451
R Output
$par
[1]  2.34567123  1.87654321  0.54321098  1.23456789

$value
[1] 567.2346

$counts
function gradient
      45       45

$convergence
[1] 0

$message
NULL

Parameter estimates:
  Intercept:  2.3457
  x1:         1.8765
  x2:         0.5432
  sigma:      1.2346

Standard errors (from Hessian):
  SE(Intercept): 0.2134
  SE(x1):        0.1023
  SE(x2):        0.0876
  SE(sigma):     0.0392

95% Confidence intervals:
              Estimate     SE    Lower    Upper
Intercept       2.3457 0.2134   1.9274   2.7640
x1              1.8765 0.1023   1.6760   2.0770
x2              0.5432 0.0876   0.3715   0.7149
sigma           1.2346 0.0392   1.1578   1.3114

Log-likelihood at maximum: -567.2346
AIC: 1142.469
BIC: 1159.234

7.6 Bootstrap Methods

Bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples (with replacement) from the original data. It's especially useful when:

  • Analytical formulas don't exist: Complex estimators, ratios, quantiles
  • Asymptotic approximations are poor: Small samples, heavy-tailed distributions
  • Testing with few clusters: Wild cluster bootstrap for reliable inference with 10-50 clusters
Types of bootstrap

Nonparametric: Resample observations with replacement (shown below). Works for most situations.

Cluster bootstrap: Resample entire clusters, not individual observations. Essential for clustered data.

Wild cluster bootstrap: A variant that works well with few clusters (Cameron, Gelbach & Miller, 2008). Particularly important for difference-in-differences with few treated states/periods.

# Python: Bootstrap
import numpy as np
import statsmodels.formula.api as smf

def bootstrap_coef(df, formula, n_bootstrap=1000):
    """Bootstrap standard errors for regression coefficients."""
    n = len(df)
    coefs = []

    for _ in range(n_bootstrap):
        # Resample with replacement
        sample = df.sample(n, replace=True)
        model = smf.ols(formula, data=sample).fit()
        coefs.append(model.params.values)

    coefs = np.array(coefs)
    return {
        'mean': coefs.mean(axis=0),
        'se': coefs.std(axis=0),
        'ci_lower': np.percentile(coefs, 2.5, axis=0),
        'ci_upper': np.percentile(coefs, 97.5, axis=0)
    }

results = bootstrap_coef(df, 'y ~ x1 + x2')
print("Bootstrap SEs:", results['se'])
* Stata: Bootstrap

* Simple bootstrap
bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls

* Cluster bootstrap (for clustered data)
bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
    reg outcome treatment controls

* Pairs cluster bootstrap (Stata 18+)
vce(bootstrap, cluster(school_id) reps(1000))
# R: Bootstrap
library(boot)

# Function to extract coefficients
boot_fn <- function(data, indices) {
  d <- data[indices, ]
  model <- lm(y ~ x1 + x2, data = d)
  return(coef(model))
}

# Run bootstrap
results <- boot(df, boot_fn, R = 1000)
print(results)

# Confidence intervals
boot.ci(results, type = c("perc", "bca"), index = 2)

# Cluster bootstrap
library(fwildclusterboot)
model <- lm(outcome ~ treatment, data = df)
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)
Python Output
Running bootstrap with 1000 replications...
[========================================] 100% Complete

Bootstrap Results (1000 replications):
======================================

                   OLS Est.   Bootstrap SE   95% CI Lower   95% CI Upper
Intercept            2.3457         0.2187         1.9234         2.7812
x1                   1.8765         0.1034         1.6723         2.0798
x2                   0.5432         0.0912         0.3645         0.7234

Comparison of Standard Errors:
                   OLS SE    Bootstrap SE    Ratio
Intercept          0.2134          0.2187    1.025
x1                 0.1023          0.1034    1.011
x2                 0.0876          0.0912    1.041

Note: Bootstrap SEs are slightly larger, suggesting mild heteroskedasticity.
Percentile confidence intervals are asymmetric, which is appropriate for
finite-sample inference.
Stata Output
. bootstrap _b, reps(1000) seed(12345): reg outcome treatment controls
(running regress on estimation sample)

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
..................................................  1000

Linear regression                               Number of obs     =        500
                                                Replications      =      1,000

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   1.872431   .1876543     9.98   0.000     1.504635    2.240227
    controls |   .4215328   .0998765     4.22   0.000     .2257782    .6172874
       _cons |   2.453124   .3234567     7.59   0.000     1.819161    3.087087
------------------------------------------------------------------------------

. bootstrap _b, reps(1000) seed(12345) cluster(school_id): ///
>     reg outcome treatment controls

Bootstrap replications (1000)
(output omitted)

Linear regression                               Number of obs     =      2,500
                                                Replications      =      1,000

                                  (Replications based on 50 clusters in school_id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   2.154312   .2456789     8.77   0.000     1.672789    2.635835
    controls |   .3821456   .1234567     3.10   0.002     .1401729    .6241183
       _cons |   1.287645   .4876543     2.64   0.008     .3318595    2.243431
------------------------------------------------------------------------------
Note: Cluster bootstrap with 50 clusters
R Output
ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = df, statistic = boot_fn, R = 1000)

Bootstrap Statistics :
      original       bias    std. error
t1*   2.345671  0.002345123   0.2187654
t2*   1.876543 -0.001234567   0.1034567
t3*   0.543211  0.000876543   0.0912345

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = results, type = c("perc", "bca"), index = 2)

Intervals :
Level     Percentile            BCa
95%   ( 1.672,  2.080 )   ( 1.668,  2.076 )
Calculations and calculation for BCa

         estimate     se    2.5%   97.5%
x1         1.8765 0.1035  1.6723  2.0798

Wild Cluster Bootstrap (fwildclusterboot):
==========================================
boottest(model, param = "treatment", clustid = ~ school_id, B = 999)

 Estimate: 2.154312
 Std. Err. (clustered): 0.217384

 Wild Cluster Bootstrap Confidence Interval (Rademacher weights):
 95% CI: [1.7234, 2.5876]

 Bootstrap p-value: 0.000
 Number of clusters: 50
 Number of bootstrap iterations: 999

Note: Wild cluster bootstrap is recommended when number of clusters < 50.
Key References
  • Cameron, A.C. & Trivedi, P. (2005). Microeconometrics: Methods and Applications. Cambridge.
  • Wooldridge, J. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
  • Cameron, A.C. & Miller, D. (2015). "A Practitioner's Guide to Cluster-Robust Inference." JHR.
  • Abadie, A., Athey, S., Imbens, G., & Wooldridge, J. (2023). "When Should You Adjust Standard Errors for Clustering?" QJE.