11A Regularization

~2.5 hours Ridge, LASSO, Elastic Net, Post-LASSO

Learning Objectives

Understand why penalizing model complexity improves predictions
Implement Ridge, LASSO, and Elastic Net regression
Use cross-validation to select the optimal penalty parameter
Apply post-LASSO methods for valid causal inference after variable selection

The Idea Behind Regularization

The Big Picture

Regularization is like a budget constraint for your model. Without it, the model can allocate unlimited weight to each predictor — it spends freely, chasing every pattern in the data, including noise. With regularization, the model has a fixed budget of coefficient magnitude. It must allocate wisely: predictors with strong signal get substantial weight, while noisy predictors get shrunk toward zero.

Think of it this way: imagine you are hiring consultants to predict house prices. An unregularized model hires all 200 applicants, paying each one something, even the astrologer. A regularized model imposes a salary cap — only the consultants who genuinely improve predictions get paid.

In standard OLS regression, we estimate coefficients by minimizing the sum of squared residuals: find the β that makes the predicted values as close to the observed values as possible. This works beautifully when we have a modest number of predictors relative to observations. But when the number of predictors is large — especially when p approaches or exceeds n — OLS overfits badly. It fits the noise in the training data, not the signal, and its predictions on new data can be wildly inaccurate. You saw this problem in the polynomial example in the main module page.

Regularization addresses this by adding a penalty term to the objective function that discourages large coefficients. Instead of simply minimizing residuals, we minimize residuals plus a penalty for model complexity. The penalty forces the model to “keep things simple” by shrinking coefficients toward zero. Larger coefficients are more expensive, so the model must have a good reason (strong signal in the data) to give any predictor a large coefficient.

Think of it like a budget constraint in economics. Without regularization, the model can allocate unlimited “weight” to each predictor — it spends freely. With regularization, the model has a fixed budget of coefficient magnitude to distribute across all predictors, so it must allocate wisely. Predictors with strong signal receive substantial weight; predictors with weak signal or noise get shrunk toward zero. This is exactly the bias-variance tradeoff at work: regularization introduces some bias (coefficients are systematically shrunk) but dramatically reduces variance (predictions are more stable across samples), and the net result is often a substantial improvement in prediction accuracy.

The general form of the regularized objective function is:

min_β { ||Y − Xβ||² + λ · Penalty(β) }

The parameter λ (lambda) controls the strength of the penalty. When λ = 0, there is no penalty and we get standard OLS. As λ increases, the penalty becomes stronger and coefficients are shrunk more aggressively toward zero. The three main regularization methods differ in their choice of penalty function, which leads to fundamentally different behavior.

Overfitting vs. Regularized Fit

OLS (no penalty)

Chases every data point — fits noise

Regularized

Captures the trend — ignores noise

Geometric Intuition: Why LASSO Zeros Out Coefficients

The OLS solution (green dot) is constrained to lie within the shaded region. The constrained solution (orange dot) is where the elliptical OLS contours first touch the constraint region.

Ridge (L2): Circle

Contours touch the circle smoothly —
both β1 and β2 remain nonzero

LASSO (L1): Diamond

Contours hit a corner on the β1 axis —
β2 is set to exactly zero

• OLS solution • Constrained solution - - OLS contours

Ridge Regression (L2)

When to Use Ridge

Use Ridge when all your predictors are potentially relevant but correlated. For example, if you are predicting GDP growth using 30 macroeconomic indicators — inflation, unemployment, trade balance, consumer confidence, etc. — many of these move together. OLS would give wildly unstable coefficients, flipping sign from sample to sample. Ridge stabilizes them by spreading weight evenly across correlated predictors.

Ridge is like turning down the volume on all speakers equally: everything gets quieter, but nothing is silenced. Every predictor stays in the model.

Ridge regression uses the L2 penalty — the sum of squared coefficients. The objective function is:

min_β { ||Y − Xβ||² + λ ∑ β_j² }

The parameter λ controls the penalty strength. When λ = 0, we get OLS. As λ grows, every coefficient is shrunk toward zero — but crucially, no coefficient is set to exactly zero. Ridge regression keeps all predictors in the model; it just makes their coefficients smaller. Geometrically, the constraint region is a circle (or sphere in higher dimensions). The elliptical OLS contours touch the circle smoothly, so the constrained solution lands on the circle’s surface — where all coordinates are nonzero.

Ridge regression is especially useful when you have many correlated predictors. In this setting, OLS coefficients are unstable: small changes in the data can cause large swings in individual coefficients, even though predictions might be similar. Ridge stabilizes the estimates by spreading the weight more evenly across correlated predictors. Think of it as a form of shrinkage that calms down the wild fluctuations that OLS produces when predictors are highly collinear.

One important practical detail: because the penalty depends on the magnitude of coefficients, you must standardize your features before fitting Ridge (and all other regularized models). If one feature is measured in dollars and another in millions of dollars, the latter will have tiny coefficients simply because of its scale, not because it is unimportant. Standardizing ensures that all features are penalized equally.

import numpy as np
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Generate data with 50 features (many irrelevant)
np.random.seed(42)
n = 200
X = np.random.randn(n, 50)
y = 3*X[:, 0] + 2*X[:, 1] - 1.5*X[:, 2] + 0.8*X[:, 3] + np.random.randn(n)*0.5

# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Fit Ridge with cross-validated alpha selection
alphas = np.logspace(-4, 4, 50)
ridge = RidgeCV(alphas=alphas, cv=5)
ridge.fit(X_train_sc, y_train)

# Evaluate
y_pred = ridge.predict(X_test_sc)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Best alpha (lambda): {ridge.alpha_:.4f}")
print(f"Test RMSE:           {rmse:.4f}")
print(f"Non-zero coefs:      {np.sum(ridge.coef_ != 0)} out of {len(ridge.coef_)}")
print(f"(Ridge never sets coefficients to exactly zero)")

* Generate data with 50 features
clear
set seed 42
set obs 200

forvalues i = 1/50 {
    gen x`i' = rnormal()
}
gen y = 3*x1 + 2*x2 - 1.5*x3 + 0.8*x4 + rnormal()*0.5

* Fit Ridge regression using lasso2 with alpha(0)
* alpha(0) = Ridge, alpha(1) = LASSO
lasso2 y x1-x50, alpha(0) lambda(0.01 0.1 1 10 100)

* Cross-validated Ridge
cvlasso y x1-x50, alpha(0) lopt

* Predict and evaluate
predict yhat
gen resid_sq = (y - yhat)^2
summarize resid_sq
display "RMSE: " sqrt(r(mean))

library(glmnet)

# Generate data with 50 features
set.seed(42)
n <- 200
X <- matrix(rnorm(n * 50), ncol = 50)
y <- 3*X[,1] + 2*X[,2] - 1.5*X[,3] + 0.8*X[,4] + rnorm(n)*0.5

# Train/test split
train_idx <- sample(1:n, 160)
X_train <- X[train_idx, ]; y_train <- y[train_idx]
X_test  <- X[-train_idx, ]; y_test  <- y[-train_idx]

# Fit Ridge with cross-validation (alpha = 0 for Ridge)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)

# Best lambda and predictions
best_lambda <- cv_ridge$lambda.min
y_pred <- predict(cv_ridge, X_test, s = best_lambda)
rmse <- sqrt(mean((y_test - y_pred)^2))

cat("Best lambda:    ", round(best_lambda, 4), "\n")
cat("Test RMSE:      ", round(rmse, 4), "\n")
coefs <- as.numeric(coef(cv_ridge, s = best_lambda))[-1]
cat("Non-zero coefs:", sum(coefs != 0), "out of", length(coefs), "\n")

Python Output

Best alpha (lambda): 0.2848
Test RMSE:           0.5314
Non-zero coefs:      50 out of 50
(Ridge never sets coefficients to exactly zero)

Stata Output

Cross-validation results (alpha = 0, Ridge):
  Optimal lambda: 0.3124
  CV MSPE:        0.2856

RMSE: .5347

R Output

Best lambda:     0.2713
Test RMSE:       0.5289
Non-zero coefs: 50 out of 50

LASSO (L1)

When to Use LASSO

Use LASSO when you suspect only a few of your many predictors actually matter. For example, with 200 potential determinants of house prices — square footage, number of rooms, distance to metro, local crime rate, school quality, street width, proximity to a park, number of nearby restaurants, etc. — you might believe only 10-20 actually drive prices. LASSO will find them and discard the rest.

LASSO is like Marie Kondo for your variables: if a predictor does not spark joy (predictive power), it gets set to exactly zero and removed from the model.

The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), uses the L1 penalty — the sum of absolute values of coefficients:

min_β { ||Y − Xβ||² + λ ∑ |β_j| }

The critical difference from Ridge is that LASSO can set coefficients to exactly zero, thereby performing automatic variable selection. This means the LASSO does not just shrink coefficients — it identifies which predictors are important and discards the rest entirely. This makes the LASSO particularly attractive when you believe that only a subset of your many predictors actually matter (a condition called sparsity).

Why does the L1 penalty produce exact zeros while L2 does not? The answer lies in the geometry. The L1 constraint region is shaped like a diamond (in 2D) or a cross-polytope (in higher dimensions). This shape has corners that lie on the coordinate axes — that is, at points where one or more coordinates are exactly zero. The elliptical contours of the OLS objective function tend to first touch the diamond at one of these corners, producing a solution with some coefficients at zero. By contrast, the L2 constraint region (a circle) has no corners, so the contours touch it at a point where all coordinates are generically nonzero.

The LASSO is your go-to method when you have many potential predictors but suspect that only a few are truly relevant. It is widely used in economics for variable selection in high-dimensional settings — for example, selecting which of hundreds of control variables to include in a regression.

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Same data as before
np.random.seed(42)
n = 200
X = np.random.randn(n, 50)
y = 3*X[:, 0] + 2*X[:, 1] - 1.5*X[:, 2] + 0.8*X[:, 3] + np.random.randn(n)*0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Fit LASSO with cross-validation
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train_sc, y_train)

# Evaluate
y_pred = lasso.predict(X_test_sc)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Best alpha (lambda): {lasso.alpha_:.6f}")
print(f"Test RMSE:           {rmse:.4f}")
print(f"Non-zero coefs:      {np.sum(lasso.coef_ != 0)} out of {len(lasso.coef_)}")

# Show which features were selected
selected = np.where(lasso.coef_ != 0)[0]
print(f"Selected features:   {selected}")
print(f"(True features are 0, 1, 2, 3)")

* LASSO with cross-validation
* alpha(1) = LASSO (default)
cvlasso y x1-x50, lopt

* Show selected variables
lasso2 y x1-x50, lambda(`e(lopt)')

* View selected coefficients
display "Number of selected variables: " e(s)

* Predict and evaluate
predict yhat_lasso
gen resid_sq = (y - yhat_lasso)^2
summarize resid_sq
display "RMSE: " sqrt(r(mean))

library(glmnet)

# Same data
set.seed(42)
n <- 200
X <- matrix(rnorm(n * 50), ncol = 50)
y <- 3*X[,1] + 2*X[,2] - 1.5*X[,3] + 0.8*X[,4] + rnorm(n)*0.5

train_idx <- sample(1:n, 160)
X_train <- X[train_idx, ]; y_train <- y[train_idx]
X_test  <- X[-train_idx, ]; y_test  <- y[-train_idx]

# Fit LASSO with cross-validation (alpha = 1 for LASSO)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)

# Extract results
coef_lasso <- coef(cv_lasso, s = "lambda.min")
nonzero <- sum(coef_lasso[-1] != 0)  # exclude intercept

y_pred <- predict(cv_lasso, X_test, s = "lambda.min")
rmse <- sqrt(mean((y_test - y_pred)^2))

cat("Best lambda:        ", round(cv_lasso$lambda.min, 6), "\n")
cat("Test RMSE:          ", round(rmse, 4), "\n")
cat("Non-zero coefs:    ", nonzero, "out of 50\n")
cat("(True features: 1, 2, 3, 4)\n")

Python Output

Best alpha (lambda): 0.024813
Test RMSE:           0.4987
Non-zero coefs:      5 out of 50
Selected features:   [0 1 2 3 27]
(True features are 0, 1, 2, 3)

Stata Output

Cross-validation results (alpha = 1, LASSO):
  Optimal lambda: 0.02314

Number of selected variables: 5

RMSE: .5021

R Output

Best lambda:         0.021738
Test RMSE:           0.5034
Non-zero coefs:     6 out of 50
(True features: 1, 2, 3, 4)

Notice that the LASSO correctly identified the 4 true predictors (features 0-3) while setting most of the 46 irrelevant features to exactly zero. It may include one or two false positives — this is normal and depends on the particular sample. The key result is that LASSO achieves variable selection automatically, unlike Ridge which keeps all 50 features.

Elastic Net

When to Use Elastic Net

Use Elastic Net when you have groups of correlated predictors and still want variable selection. For example, in labor economics you might have years of education, degree type, field of study, and test scores — all measuring "human capital." LASSO would arbitrarily keep one and drop the rest. Elastic Net keeps the whole group together while still zeroing out genuinely irrelevant predictors.

Think of Elastic Net as a compromise diplomat: it has LASSO's courage to eliminate useless variables, but Ridge's fairness in treating correlated variables as a team.

The Elastic Net, introduced by Zou and Hastie (2005), combines the L1 and L2 penalties into a single objective:

min_β { ||Y − Xβ||² + λ(α||β||₁ + (1−α)||β||²₂) }

The parameter α (alpha) controls the mix between L1 and L2: α = 1 gives pure LASSO, α = 0 gives pure Ridge. Values in between provide a compromise. The Elastic Net inherits variable selection from LASSO (through the L1 part) while also benefiting from Ridge’s stability with correlated predictors (through the L2 part).

When should you use Elastic Net instead of pure LASSO? The main situation is when you have groups of correlated predictors. LASSO has a tendency to arbitrarily pick one variable from a group of correlated predictors and zero out the rest — which one it picks can change from sample to sample. The Elastic Net’s L2 component encourages it to keep correlated predictors together: if several variables carry similar information, the Elastic Net will include all of them with similar coefficients rather than picking just one.

import numpy as np
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error

# Using X_train_sc, X_test_sc, y_train, y_test from above
# Fit Elastic Net with CV over both alpha and l1_ratio
enet = ElasticNetCV(
    l1_ratio=[.1, .5, .7, .9, .95, 1],
    cv=5,
    random_state=42
)
enet.fit(X_train_sc, y_train)

y_pred_enet = enet.predict(X_test_sc)
rmse_enet = np.sqrt(mean_squared_error(y_test, y_pred_enet))

print(f"Best alpha:     {enet.alpha_:.6f}")
print(f"Best l1_ratio:  {enet.l1_ratio_:.2f}")
print(f"Test RMSE:      {rmse_enet:.4f}")
print(f"Non-zero coefs: {np.sum(enet.coef_ != 0)} out of {len(enet.coef_)}")

* Elastic Net with alpha = 0.5 (equal mix of L1 and L2)
* lasso2 with alpha(0.5) fits Elastic Net
lasso2 y x1-x50, alpha(0.5)

* Cross-validated Elastic Net
cvlasso y x1-x50, alpha(0.5) lopt

* Check results
display "Selected variables: " e(s)
display "Optimal lambda: " e(lopt)

# Elastic Net: alpha = 0.5 (equal mix of L1 and L2)
cv_enet <- cv.glmnet(X_train, y_train, alpha = 0.5)

y_pred_enet <- predict(cv_enet, X_test, s = "lambda.min")
rmse_enet <- sqrt(mean((y_test - y_pred_enet)^2))
coef_enet <- coef(cv_enet, s = "lambda.min")

cat("Best lambda:    ", round(cv_enet$lambda.min, 6), "\n")
cat("Test RMSE:      ", round(rmse_enet, 4), "\n")
cat("Non-zero coefs:", sum(coef_enet[-1] != 0), "out of 50\n")

Python Output

Best alpha:     0.024813
Best l1_ratio:  0.95
Test RMSE:      0.5012
Non-zero coefs: 5 out of 50

Stata Output

Cross-validation results (alpha = 0.5, Elastic Net):
  Optimal lambda: 0.04628

Selected variables: 6
Optimal lambda: .04628

R Output

Best lambda:     0.043476
Test RMSE:       0.5078
Non-zero coefs: 5 out of 50

How Coefficients Shrink as λ Increases

Each colored line is one predictor's coefficient. Watch what happens as the penalty grows.

Coefficients approach zero asymptotically
but never arrive — all predictors stay in

Weak predictors hit exactly zero and drop out
— automatic variable selection

Cross-Validated Penalty Selection

The penalty parameter λ is a hyperparameter — it is not estimated from the data by the model itself but must be chosen by the researcher. Choosing λ too small gives us something close to OLS (underfitting risk); choosing it too large shrinks everything to zero (the model predicts the mean for everyone). The standard approach is to use cross-validation to select λ from data.

The procedure works as follows. Define a grid of λ values (typically logarithmically spaced from very small to very large). For each λ, run K-fold cross-validation and compute the average CV error. This produces a curve of CV error as a function of λ. Two values along this curve are commonly reported:

λ_min: the value of λ that minimizes the cross-validation error. This gives the best-predicting model.
λ_1se: the largest λ within one standard error of the minimum CV error. This gives a more parsimonious model that is nearly as accurate as the best one. It is often preferred in practice because it selects fewer variables and is more interpretable, with only a small cost in prediction accuracy.

The choice between λ_min and λ_1se depends on your priorities. If pure prediction accuracy is your goal, use λ_min. If you want a simpler, more interpretable model that is nearly as good, use λ_1se. In economics applications where you will subsequently interpret the selected variables, λ_1se is often the better choice.

import numpy as np
from sklearn.linear_model import LassoCV

# Fit LASSO with detailed CV information
lasso_cv = LassoCV(cv=5, n_alphas=100, random_state=42)
lasso_cv.fit(X_train_sc, y_train)

# The CV curve: mean MSE and std for each alpha
mse_mean = np.mean(lasso_cv.mse_path_, axis=1)
mse_std  = np.std(lasso_cv.mse_path_, axis=1)

# lambda_min (best alpha)
best_idx = np.argmin(mse_mean)
print(f"lambda_min:         {lasso_cv.alphas_[best_idx]:.6f}")
print(f"CV MSE at min:      {mse_mean[best_idx]:.4f}")

# lambda_1se: largest alpha within 1 SE of minimum
threshold = mse_mean[best_idx] + mse_std[best_idx]
candidates = np.where(mse_mean <= threshold)[0]
idx_1se = candidates[0]  # largest alpha (first in sorted order)
print(f"lambda_1se:         {lasso_cv.alphas_[idx_1se]:.6f}")
print(f"CV MSE at 1se:      {mse_mean[idx_1se]:.4f}")
print(f"\nNon-zero at min:    {np.sum(lasso_cv.coef_ != 0)}")

* Cross-validated LASSO with lambda path
cvlasso y x1-x50

* Display both lambda_min and lambda_1se
display "Lambda (min CV):  " e(lopt)
display "Lambda (1 SE):    " e(lse)

* Fit at lambda_1se for a more parsimonious model
lasso2 y x1-x50, lambda(`e(lse)')
display "Variables selected (1se): " e(s)

# The cv.glmnet object contains both lambda.min and lambda.1se
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)

cat("lambda.min:  ", round(cv_lasso$lambda.min, 6), "\n")
cat("lambda.1se:  ", round(cv_lasso$lambda.1se, 6), "\n")

# Compare number of selected variables
coef_min <- coef(cv_lasso, s = "lambda.min")
coef_1se <- coef(cv_lasso, s = "lambda.1se")
cat("Non-zero (min):", sum(coef_min[-1] != 0), "\n")
cat("Non-zero (1se):", sum(coef_1se[-1] != 0), "\n")

# Plot the CV curve
plot(cv_lasso)
title("LASSO Cross-Validation Curve")

Python Output

lambda_min:         0.024813
CV MSE at min:      0.2714
lambda_1se:         0.068129
CV MSE at 1se:      0.2981

Non-zero at min:    5

Stata Output

Lambda (min CV):  .02314
Lambda (1 SE):    .06982
Variables selected (1se): 4

R Output

lambda.min:   0.021738
lambda.1se:   0.072314
Non-zero (min): 6
Non-zero (1se): 4

Alternative: GridSearchCV for Hyperparameter Tuning

While LassoCV and RidgeCV handle cross-validated penalty selection in one step, you can also use scikit-learn's general-purpose GridSearchCV, which works with any model and any set of hyperparameters. This is particularly useful when tuning multiple hyperparameters simultaneously (e.g., Elastic Net's alpha and l1_ratio), or when you want a unified workflow across different model types.

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

# Define the parameter grid
param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}

# GridSearchCV tries every alpha with 5-fold CV
grid_search = GridSearchCV(
    Lasso(), param_grid,
    cv=5, scoring='neg_mean_squared_error'
)
grid_search.fit(X_train_sc, y_train)

# Best alpha selected by CV
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")

# Evaluate ONCE on the test set (never touch test set during tuning)
y_pred = grid_search.predict(X_test_sc)
test_rmse = np.sqrt(np.mean((y_pred - y_test)**2))
print(f"Test RMSE: {test_rmse:.4f}")

# R: manual grid search with cv.glmnet
library(glmnet)

# cv.glmnet automatically searches over a lambda grid
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 5)

# For multiple hyperparameters, use caret or tidymodels
library(caret)
train_df <- data.frame(y = y_train, X_train)
grid <- expand.grid(alpha = 1, lambda = 10^seq(-3, 1, length = 50))
model <- train(y ~ ., data = train_df,
               method = "glmnet",
               trControl = trainControl(method = "cv", number = 5),
               tuneGrid = grid)
print(model$bestTune)

Python Output

Best alpha: 0.1 Best CV MSE: 0.4523 Test RMSE: 0.6812

R Output

alpha lambda 23 1 0.01291550

Never tune on the test set

If you loop over hyperparameter values and pick the one with the best test error, you are effectively overfitting to the test set. The reported test performance will be optimistic. Always use cross-validation on the training set to select hyperparameters, then evaluate on the test set exactly once.

Post-LASSO and Post-Selection Inference

Why This Section Matters for Economists

In economics, we usually care about causal effects, not just predictions. LASSO is excellent for prediction and variable selection, but its coefficients are biased (they are shrunk toward zero by design) and its standard errors are invalid (they ignore the fact that the model was chosen from the data). Post-LASSO methods fix this: they use LASSO to select variables, then re-estimate using OLS on the selected variables with proper inference. Think of it as a two-step process: LASSO is the scout that identifies which variables matter; OLS is the analyst that measures their effects with correct standard errors.

Warning: Post-Selection Inference is Invalid

Never report LASSO coefficients as causal estimates. LASSO is a prediction tool. If you select variables with LASSO and then run OLS on the selected variables, the resulting confidence intervals and p-values are invalid. For causal inference after variable selection, use post-double-selection or DML methods (see Module 11D).

A common but dangerous practice is the following: run LASSO to select variables, then run OLS on only the selected variables and report the standard OLS confidence intervals and p-values. This procedure produces invalid inference. The reason, established rigorously by Leeb and Pötscher (2005), is that LASSO’s variable selection is data-dependent. Standard inference formulas assume the model was specified a priori — before looking at the data. When the model is chosen from the data itself, the usual standard errors are too small and the confidence intervals are too narrow, leading to spuriously significant results.

The problem is sometimes called the “winner’s curse” of variable selection: variables that happen to look important in the particular sample are more likely to be selected, and their coefficients are biased upward in absolute value. Naively running OLS on the selected variables inherits this bias without correcting for it.

Belloni, Chernozhukov, and Hansen (2014) developed the post-double-selection method as a principled solution. The key insight is that in a causal inference setting with a treatment variable D, an outcome Y, and many potential controls X, you need to select controls that are relevant to both Y and D. The procedure is:

Run LASSO of Y on X — this selects controls predictive of the outcome.
Run LASSO of D on X — this selects controls predictive of treatment (potential confounders).
Take the union of both selected sets.
Run OLS of Y on D and the union of selected controls. The resulting coefficient on D has valid standard errors.

The double selection step is essential: selecting controls for the outcome alone (step 1) might miss confounders that strongly predict treatment but weakly predict the outcome. Including them in step 2 ensures that the causal estimate of D’s effect is not confounded. This approach is implemented in Stata’s pdslasso command and R’s hdm package. A more general framework — Double/Debiased Machine Learning (DML) — extends these ideas and is covered in detail in Module 11D.

import numpy as np
from sklearn.linear_model import LassoCV
import statsmodels.api as sm

# Simulate causal inference setting
np.random.seed(42)
n = 500
X = np.random.randn(n, 50)
# Treatment depends on X[0] and X[1]
d = 0.5*X[:, 0] + 0.3*X[:, 1] + np.random.randn(n)*0.5
# Outcome depends on treatment, X[0], X[1], and X[2]
# TRUE treatment effect = 1.0
y = 1.0*d + 2*X[:, 0] + 1.5*X[:, 1] + 1*X[:, 2] + np.random.randn(n)*0.5

# Step 1: LASSO of Y on X (select controls for outcome)
lasso_y = LassoCV(cv=5).fit(X, y)
sel_y = set(np.where(lasso_y.coef_ != 0)[0])

# Step 2: LASSO of D on X (select controls for treatment)
lasso_d = LassoCV(cv=5).fit(X, d)
sel_d = set(np.where(lasso_d.coef_ != 0)[0])

# Step 3: Take the union
sel_union = sorted(sel_y | sel_d)
print(f"Selected for Y: {sorted(sel_y)}")
print(f"Selected for D: {sorted(sel_d)}")
print(f"Union:          {sel_union}")

# Step 4: OLS of Y on D and union of selected controls
X_selected = X[:, sel_union]
X_ols = sm.add_constant(np.column_stack([d, X_selected]))
ols_result = sm.OLS(y, X_ols).fit()

print(f"\nPost-double-selection estimate of treatment effect:")
print(f"  Coefficient:  {ols_result.params[1]:.4f}")
print(f"  Std error:    {ols_result.bse[1]:.4f}")
print(f"  95% CI:       [{ols_result.conf_int()[1][0]:.4f}, {ols_result.conf_int()[1][1]:.4f}]")
print(f"  (True effect: 1.0)")

* Simulate causal inference setting
clear
set seed 42
set obs 500

forvalues i = 1/50 {
    gen x`i' = rnormal()
}
gen d = 0.5*x1 + 0.3*x2 + rnormal()*0.5
gen y = 1.0*d + 2*x1 + 1.5*x2 + 1*x3 + rnormal()*0.5

* Post-double-selection LASSO
* pdslasso automatically performs both selection steps
pdslasso y d (x1-x50)

* The output shows the treatment effect with valid standard errors
display "True treatment effect: 1.0"

library(hdm)

# Simulate causal inference setting
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 50), ncol = 50)
d <- 0.5*X[,1] + 0.3*X[,2] + rnorm(n)*0.5
y <- 1.0*d + 2*X[,1] + 1.5*X[,2] + 1*X[,3] + rnorm(n)*0.5

# Post-double-selection: automatic procedure
result <- rlassoEffect(x = X, y = y, d = d, method = "double selection")

cat("Post-double-selection results:\n")
print(summary(result))
cat("\n(True treatment effect: 1.0)\n")

Python Output

Selected for Y: [0, 1, 2, 14]
Selected for D: [0, 1]
Union:          [0, 1, 2, 14]

Post-double-selection estimate of treatment effect:
  Coefficient:  1.0218
  Std error:    0.0453
  95% CI:       [0.9328, 1.1108]
  (True effect: 1.0)

Stata Output

Post-double-selection results:

         y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------+----------------------------------------------------------------
         d |   1.01834    .044721    22.77   0.000     .93053    1.10615
-----------+----------------------------------------------------------------
Selected controls: x1 x2 x3

True treatment effect: 1.0

R Output

Post-double-selection results:

Estimate: 1.0195
Std. Error: 0.0461
t value: 22.12
Pr(>|t|): < 2e-16

(True treatment effect: 1.0)

Choosing the Right Method

Ridge (L2)

Many predictors, all potentially relevant
Highly correlated features
Does not perform variable selection
Shrinks coefficients but keeps all
Generally best when p < n and features are correlated

Economics example: Predicting GDP growth from 30 correlated macro indicators (inflation, unemployment, trade balance, etc.)

LASSO (L1)

Many predictors, only a few matter (sparsity)
Automatic variable selection
Sets irrelevant coefficients to exactly zero
May be unstable with correlated features
Best when you believe the true model is sparse

Economics example: With 200 potential determinants of house prices, LASSO selects the 15 that matter and discards the rest

Elastic Net

Many correlated predictors and you want selection
Compromise between Ridge and LASSO
Groups correlated variables together
More stable than LASSO with correlations
Requires tuning two parameters (λ and α)

Economics example: Wage determinants with correlated "human capital" variables (education, test scores, experience) that should stay or drop as a group

Limitations

All three regularization methods share several important limitations that you should keep in mind:

Linearity assumption: Ridge, LASSO, and Elastic Net all assume a linear relationship between features and the outcome. They cannot capture nonlinearities or interactions unless you manually create polynomial or interaction features. For genuinely nonlinear relationships, consider tree-based methods (Module 11B) or neural networks (Module 11C).
Feature scaling matters: Because the penalty depends on coefficient magnitudes, you must always standardize your features before fitting regularized models. If you forget, features measured on larger scales will be penalized more heavily, regardless of their importance.
Cannot handle categorical variables natively: Categorical variables must be one-hot encoded. When a categorical variable has many levels, LASSO may select some dummies and not others from the same category, which can be hard to interpret. Group LASSO (not covered here) addresses this.
LASSO is not consistent for variable selection in all settings: The LASSO can fail to select the correct variables when the “irrepresentable condition” is violated — roughly, when irrelevant variables are highly correlated with relevant ones.

References

Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B, 58(1), 267–288.
Hoerl, A. & Kennard, R. (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, 12(1), 55–67.
Zou, H. & Hastie, T. (2005). “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
Belloni, A., Chernozhukov, V., & Hansen, C. (2014). “Inference on Treatment Effects after Selection among High-Dimensional Controls.” Review of Economic Studies, 81(2), 608–650.
Leeb, H. & Pötscher, B. (2005). “Model Selection and Inference: Facts and Fiction.” Econometric Theory, 21(1), 21–59.

ProTools ER1

Course Modules

11A Regularization

Learning Objectives

In This Section

The Idea Behind Regularization

The Big Picture

Overfitting vs. Regularized Fit

Geometric Intuition: Why LASSO Zeros Out Coefficients

Ridge Regression (L2)

When to Use Ridge

LASSO (L1)

When to Use LASSO

Elastic Net

When to Use Elastic Net

How Coefficients Shrink as λ Increases

Cross-Validated Penalty Selection

Alternative: GridSearchCV for Hyperparameter Tuning

Post-LASSO and Post-Selection Inference

Why This Section Matters for Economists

Warning: Post-Selection Inference is Invalid

Choosing the Right Method

Ridge (L2)

LASSO (L1)

Elastic Net

Limitations

References

ProTools ER1 Assistant