11A Regularization
Learning Objectives
- Understand why penalizing model complexity improves predictions
- Implement Ridge, LASSO, and Elastic Net regression
- Use cross-validation to select the optimal penalty parameter
- Apply post-LASSO methods for valid causal inference after variable selection
In This Section
The Idea Behind Regularization
In standard OLS regression, we estimate coefficients by minimizing the sum of squared residuals: find the β that makes the predicted values as close to the observed values as possible. This works beautifully when we have a modest number of predictors relative to observations. But when the number of predictors is large — especially when p approaches or exceeds n — OLS overfits badly. It fits the noise in the training data, not the signal, and its predictions on new data can be wildly inaccurate. You saw this problem in the polynomial example in the main module page.
Regularization addresses this by adding a penalty term to the objective function that discourages large coefficients. Instead of simply minimizing residuals, we minimize residuals plus a penalty for model complexity. The penalty forces the model to “keep things simple” by shrinking coefficients toward zero. Larger coefficients are more expensive, so the model must have a good reason (strong signal in the data) to give any predictor a large coefficient.
Think of it like a budget constraint in economics. Without regularization, the model can allocate unlimited “weight” to each predictor — it spends freely. With regularization, the model has a fixed budget of coefficient magnitude to distribute across all predictors, so it must allocate wisely. Predictors with strong signal receive substantial weight; predictors with weak signal or noise get shrunk toward zero. This is exactly the bias-variance tradeoff at work: regularization introduces some bias (coefficients are systematically shrunk) but dramatically reduces variance (predictions are more stable across samples), and the net result is often a substantial improvement in prediction accuracy.
The general form of the regularized objective function is:
The parameter λ (lambda) controls the strength of the penalty. When λ = 0, there is no penalty and we get standard OLS. As λ increases, the penalty becomes stronger and coefficients are shrunk more aggressively toward zero. The three main regularization methods differ in their choice of penalty function, which leads to fundamentally different behavior.
Geometric Intuition: Constraint Regions
Ridge (L2)
Circle: shrinks all coefficients
but none reach exactly zero
LASSO (L1)
Diamond: OLS solution hits corners
→ some coefficients = exactly 0
• OLS solution • Origin The constrained solution is where the OLS contours first touch the constraint region.
Ridge Regression (L2)
Ridge regression uses the L2 penalty — the sum of squared coefficients. The objective function is:
The parameter λ controls the penalty strength. When λ = 0, we get OLS. As λ grows, every coefficient is shrunk toward zero — but crucially, no coefficient is set to exactly zero. Ridge regression keeps all predictors in the model; it just makes their coefficients smaller. Geometrically, the constraint region is a circle (or sphere in higher dimensions). The elliptical OLS contours touch the circle smoothly, so the constrained solution lands on the circle’s surface — where all coordinates are nonzero.
Ridge regression is especially useful when you have many correlated predictors. In this setting, OLS coefficients are unstable: small changes in the data can cause large swings in individual coefficients, even though predictions might be similar. Ridge stabilizes the estimates by spreading the weight more evenly across correlated predictors. Think of it as a form of shrinkage that calms down the wild fluctuations that OLS produces when predictors are highly collinear.
One important practical detail: because the penalty depends on the magnitude of coefficients, you must standardize your features before fitting Ridge (and all other regularized models). If one feature is measured in dollars and another in millions of dollars, the latter will have tiny coefficients simply because of its scale, not because it is unimportant. Standardizing ensures that all features are penalized equally.
import numpy as np
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
# Generate data with 50 features (many irrelevant)
np.random.seed(42)
n = 200
X = np.random.randn(n, 50)
y = 3*X[:, 0] + 2*X[:, 1] - 1.5*X[:, 2] + 0.8*X[:, 3] + np.random.randn(n)*0.5
# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Fit Ridge with cross-validated alpha selection
alphas = np.logspace(-4, 4, 50)
ridge = RidgeCV(alphas=alphas, cv=5)
ridge.fit(X_train_sc, y_train)
# Evaluate
y_pred = ridge.predict(X_test_sc)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Best alpha (lambda): {ridge.alpha_:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"Non-zero coefs: {np.sum(ridge.coef_ != 0)} out of {len(ridge.coef_)}")
print(f"(Ridge never sets coefficients to exactly zero)")
* Generate data with 50 features
clear
set seed 42
set obs 200
forvalues i = 1/50 {
gen x`i' = rnormal()
}
gen y = 3*x1 + 2*x2 - 1.5*x3 + 0.8*x4 + rnormal()*0.5
* Fit Ridge regression using lasso2 with alpha(0)
* alpha(0) = Ridge, alpha(1) = LASSO
lasso2 y x1-x50, alpha(0) lambda(0.01 0.1 1 10 100)
* Cross-validated Ridge
cvlasso y x1-x50, alpha(0) lopt
* Predict and evaluate
predict yhat
gen resid_sq = (y - yhat)^2
summarize resid_sq
display "RMSE: " sqrt(r(mean))
library(glmnet)
# Generate data with 50 features
set.seed(42)
n <- 200
X <- matrix(rnorm(n * 50), ncol = 50)
y <- 3*X[,1] + 2*X[,2] - 1.5*X[,3] + 0.8*X[,4] + rnorm(n)*0.5
# Train/test split
train_idx <- sample(1:n, 160)
X_train <- X[train_idx, ]; y_train <- y[train_idx]
X_test <- X[-train_idx, ]; y_test <- y[-train_idx]
# Fit Ridge with cross-validation (alpha = 0 for Ridge)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
# Best lambda and predictions
best_lambda <- cv_ridge$lambda.min
y_pred <- predict(cv_ridge, X_test, s = best_lambda)
rmse <- sqrt(mean((y_test - y_pred)^2))
cat("Best lambda: ", round(best_lambda, 4), "\n")
cat("Test RMSE: ", round(rmse, 4), "\n")
coefs <- as.numeric(coef(cv_ridge, s = best_lambda))[-1]
cat("Non-zero coefs:", sum(coefs != 0), "out of", length(coefs), "\n")
Best alpha (lambda): 0.2848 Test RMSE: 0.5314 Non-zero coefs: 50 out of 50 (Ridge never sets coefficients to exactly zero)
Cross-validation results (alpha = 0, Ridge): Optimal lambda: 0.3124 CV MSPE: 0.2856 RMSE: .5347
Best lambda: 0.2713 Test RMSE: 0.5289 Non-zero coefs: 50 out of 50
LASSO (L1)
The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), uses the L1 penalty — the sum of absolute values of coefficients:
The critical difference from Ridge is that LASSO can set coefficients to exactly zero, thereby performing automatic variable selection. This means the LASSO does not just shrink coefficients — it identifies which predictors are important and discards the rest entirely. This makes the LASSO particularly attractive when you believe that only a subset of your many predictors actually matter (a condition called sparsity).
Why does the L1 penalty produce exact zeros while L2 does not? The answer lies in the geometry. The L1 constraint region is shaped like a diamond (in 2D) or a cross-polytope (in higher dimensions). This shape has corners that lie on the coordinate axes — that is, at points where one or more coordinates are exactly zero. The elliptical contours of the OLS objective function tend to first touch the diamond at one of these corners, producing a solution with some coefficients at zero. By contrast, the L2 constraint region (a circle) has no corners, so the contours touch it at a point where all coordinates are generically nonzero.
The LASSO is your go-to method when you have many potential predictors but suspect that only a few are truly relevant. It is widely used in economics for variable selection in high-dimensional settings — for example, selecting which of hundreds of control variables to include in a regression.
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Same data as before
np.random.seed(42)
n = 200
X = np.random.randn(n, 50)
y = 3*X[:, 0] + 2*X[:, 1] - 1.5*X[:, 2] + 0.8*X[:, 3] + np.random.randn(n)*0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Fit LASSO with cross-validation
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train_sc, y_train)
# Evaluate
y_pred = lasso.predict(X_test_sc)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Best alpha (lambda): {lasso.alpha_:.6f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"Non-zero coefs: {np.sum(lasso.coef_ != 0)} out of {len(lasso.coef_)}")
# Show which features were selected
selected = np.where(lasso.coef_ != 0)[0]
print(f"Selected features: {selected}")
print(f"(True features are 0, 1, 2, 3)")
* LASSO with cross-validation
* alpha(1) = LASSO (default)
cvlasso y x1-x50, lopt
* Show selected variables
lasso2 y x1-x50, lambda(`e(lopt)')
* View selected coefficients
display "Number of selected variables: " e(s)
* Predict and evaluate
predict yhat_lasso
gen resid_sq = (y - yhat_lasso)^2
summarize resid_sq
display "RMSE: " sqrt(r(mean))
library(glmnet)
# Same data
set.seed(42)
n <- 200
X <- matrix(rnorm(n * 50), ncol = 50)
y <- 3*X[,1] + 2*X[,2] - 1.5*X[,3] + 0.8*X[,4] + rnorm(n)*0.5
train_idx <- sample(1:n, 160)
X_train <- X[train_idx, ]; y_train <- y[train_idx]
X_test <- X[-train_idx, ]; y_test <- y[-train_idx]
# Fit LASSO with cross-validation (alpha = 1 for LASSO)
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
# Extract results
coef_lasso <- coef(cv_lasso, s = "lambda.min")
nonzero <- sum(coef_lasso[-1] != 0) # exclude intercept
y_pred <- predict(cv_lasso, X_test, s = "lambda.min")
rmse <- sqrt(mean((y_test - y_pred)^2))
cat("Best lambda: ", round(cv_lasso$lambda.min, 6), "\n")
cat("Test RMSE: ", round(rmse, 4), "\n")
cat("Non-zero coefs: ", nonzero, "out of 50\n")
cat("(True features: 1, 2, 3, 4)\n")
Best alpha (lambda): 0.024813 Test RMSE: 0.4987 Non-zero coefs: 5 out of 50 Selected features: [0 1 2 3 27] (True features are 0, 1, 2, 3)
Cross-validation results (alpha = 1, LASSO): Optimal lambda: 0.02314 Number of selected variables: 5 RMSE: .5021
Best lambda: 0.021738 Test RMSE: 0.5034 Non-zero coefs: 6 out of 50 (True features: 1, 2, 3, 4)
Notice that the LASSO correctly identified the 4 true predictors (features 0-3) while setting most of the 46 irrelevant features to exactly zero. It may include one or two false positives — this is normal and depends on the particular sample. The key result is that LASSO achieves variable selection automatically, unlike Ridge which keeps all 50 features.
Elastic Net
The Elastic Net, introduced by Zou and Hastie (2005), combines the L1 and L2 penalties into a single objective:
The parameter α (alpha) controls the mix between L1 and L2: α = 1 gives pure LASSO, α = 0 gives pure Ridge. Values in between provide a compromise. The Elastic Net inherits variable selection from LASSO (through the L1 part) while also benefiting from Ridge’s stability with correlated predictors (through the L2 part).
When should you use Elastic Net instead of pure LASSO? The main situation is when you have groups of correlated predictors. LASSO has a tendency to arbitrarily pick one variable from a group of correlated predictors and zero out the rest — which one it picks can change from sample to sample. The Elastic Net’s L2 component encourages it to keep correlated predictors together: if several variables carry similar information, the Elastic Net will include all of them with similar coefficients rather than picking just one.
from sklearn.linear_model import ElasticNetCV
# Fit Elastic Net with CV over both alpha and l1_ratio
enet = ElasticNetCV(
l1_ratio=[.1, .5, .7, .9, .95, 1],
cv=5,
random_state=42
)
enet.fit(X_train_sc, y_train)
y_pred_enet = enet.predict(X_test_sc)
rmse_enet = np.sqrt(mean_squared_error(y_test, y_pred_enet))
print(f"Best alpha: {enet.alpha_:.6f}")
print(f"Best l1_ratio: {enet.l1_ratio_:.2f}")
print(f"Test RMSE: {rmse_enet:.4f}")
print(f"Non-zero coefs: {np.sum(enet.coef_ != 0)} out of {len(enet.coef_)}")
* Elastic Net with alpha = 0.5 (equal mix of L1 and L2)
elasticnet y x1-x50, alpha(0.5)
* Cross-validated Elastic Net
cvlasso y x1-x50, alpha(0.5) lopt
* Check results
display "Selected variables: " e(s)
display "Optimal lambda: " e(lopt)
# Elastic Net: alpha = 0.5 (equal mix of L1 and L2)
cv_enet <- cv.glmnet(X_train, y_train, alpha = 0.5)
y_pred_enet <- predict(cv_enet, X_test, s = "lambda.min")
rmse_enet <- sqrt(mean((y_test - y_pred_enet)^2))
coef_enet <- coef(cv_enet, s = "lambda.min")
cat("Best lambda: ", round(cv_enet$lambda.min, 6), "\n")
cat("Test RMSE: ", round(rmse_enet, 4), "\n")
cat("Non-zero coefs:", sum(coef_enet[-1] != 0), "out of 50\n")
Best alpha: 0.024813 Best l1_ratio: 0.95 Test RMSE: 0.5012 Non-zero coefs: 5 out of 50
Cross-validation results (alpha = 0.5, Elastic Net): Optimal lambda: 0.04628 Selected variables: 6 Optimal lambda: .04628
Best lambda: 0.043476 Test RMSE: 0.5078 Non-zero coefs: 5 out of 50
Cross-Validated Penalty Selection
The penalty parameter λ is a hyperparameter — it is not estimated from the data by the model itself but must be chosen by the researcher. Choosing λ too small gives us something close to OLS (underfitting risk); choosing it too large shrinks everything to zero (the model predicts the mean for everyone). The standard approach is to use cross-validation to select λ from data.
The procedure works as follows. Define a grid of λ values (typically logarithmically spaced from very small to very large). For each λ, run K-fold cross-validation and compute the average CV error. This produces a curve of CV error as a function of λ. Two values along this curve are commonly reported:
- λmin: the value of λ that minimizes the cross-validation error. This gives the best-predicting model.
- λ1se: the largest λ within one standard error of the minimum CV error. This gives a more parsimonious model that is nearly as accurate as the best one. It is often preferred in practice because it selects fewer variables and is more interpretable, with only a small cost in prediction accuracy.
The choice between λmin and λ1se depends on your priorities. If pure prediction accuracy is your goal, use λmin. If you want a simpler, more interpretable model that is nearly as good, use λ1se. In economics applications where you will subsequently interpret the selected variables, λ1se is often the better choice.
import numpy as np
from sklearn.linear_model import LassoCV
# Fit LASSO with detailed CV information
lasso_cv = LassoCV(cv=5, n_alphas=100, random_state=42)
lasso_cv.fit(X_train_sc, y_train)
# The CV curve: mean MSE and std for each alpha
mse_mean = np.mean(lasso_cv.mse_path_, axis=1)
mse_std = np.std(lasso_cv.mse_path_, axis=1)
# lambda_min (best alpha)
best_idx = np.argmin(mse_mean)
print(f"lambda_min: {lasso_cv.alphas_[best_idx]:.6f}")
print(f"CV MSE at min: {mse_mean[best_idx]:.4f}")
# lambda_1se: largest alpha within 1 SE of minimum
threshold = mse_mean[best_idx] + mse_std[best_idx]
candidates = np.where(mse_mean <= threshold)[0]
idx_1se = candidates[0] # largest alpha (first in sorted order)
print(f"lambda_1se: {lasso_cv.alphas_[idx_1se]:.6f}")
print(f"CV MSE at 1se: {mse_mean[idx_1se]:.4f}")
print(f"\nNon-zero at min: {np.sum(lasso_cv.coef_ != 0)}")
* Cross-validated LASSO with lambda path
cvlasso y x1-x50
* Display both lambda_min and lambda_1se
display "Lambda (min CV): " e(lopt)
display "Lambda (1 SE): " e(lse)
* Fit at lambda_1se for a more parsimonious model
lasso2 y x1-x50, lambda(`e(lse)')
display "Variables selected (1se): " e(s)
# The cv.glmnet object contains both lambda.min and lambda.1se
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
cat("lambda.min: ", round(cv_lasso$lambda.min, 6), "\n")
cat("lambda.1se: ", round(cv_lasso$lambda.1se, 6), "\n")
# Compare number of selected variables
coef_min <- coef(cv_lasso, s = "lambda.min")
coef_1se <- coef(cv_lasso, s = "lambda.1se")
cat("Non-zero (min):", sum(coef_min[-1] != 0), "\n")
cat("Non-zero (1se):", sum(coef_1se[-1] != 0), "\n")
# Plot the CV curve
plot(cv_lasso)
title("LASSO Cross-Validation Curve")
lambda_min: 0.024813 CV MSE at min: 0.2714 lambda_1se: 0.068129 CV MSE at 1se: 0.2981 Non-zero at min: 5
Lambda (min CV): .02314 Lambda (1 SE): .06982 Variables selected (1se): 4
lambda.min: 0.021738 lambda.1se: 0.072314 Non-zero (min): 6 Non-zero (1se): 4
Alternative: GridSearchCV for Hyperparameter Tuning
While LassoCV and RidgeCV handle cross-validated penalty selection in one step, you can also use scikit-learn's general-purpose GridSearchCV, which works with any model and any set of hyperparameters. This is particularly useful when tuning multiple hyperparameters simultaneously (e.g., Elastic Net's alpha and l1_ratio), or when you want a unified workflow across different model types.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
# Define the parameter grid
param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
# GridSearchCV tries every alpha with 5-fold CV
grid_search = GridSearchCV(
Lasso(), param_grid,
cv=5, scoring='neg_mean_squared_error'
)
grid_search.fit(X_train_scaled, y_train)
# Best alpha selected by CV
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")
# Evaluate ONCE on the test set (never touch test set during tuning)
y_pred = grid_search.predict(X_test_scaled)
test_rmse = np.sqrt(np.mean((y_pred - y_test)**2))
print(f"Test RMSE: {test_rmse:.4f}")
# R: manual grid search with cv.glmnet
library(glmnet)
# cv.glmnet automatically searches over a lambda grid
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 5)
# For multiple hyperparameters, use caret or tidymodels
library(caret)
grid <- expand.grid(alpha = 1, lambda = 10^seq(-3, 1, length = 50))
model <- train(y ~ ., data = train_data,
method = "glmnet",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = grid)
print(model$bestTune)
If you loop over hyperparameter values and pick the one with the best test error, you are effectively overfitting to the test set. The reported test performance will be optimistic. Always use cross-validation on the training set to select hyperparameters, then evaluate on the test set exactly once.
Post-LASSO and Post-Selection Inference
Warning: Post-Selection Inference is Invalid
Never report LASSO coefficients as causal estimates. LASSO is a prediction tool. If you select variables with LASSO and then run OLS on the selected variables, the resulting confidence intervals and p-values are invalid. For causal inference after variable selection, use post-double-selection or DML methods (see Module 11D).
A common but dangerous practice is the following: run LASSO to select variables, then run OLS on only the selected variables and report the standard OLS confidence intervals and p-values. This procedure produces invalid inference. The reason, established rigorously by Leeb and Pötscher (2005), is that LASSO’s variable selection is data-dependent. Standard inference formulas assume the model was specified a priori — before looking at the data. When the model is chosen from the data itself, the usual standard errors are too small and the confidence intervals are too narrow, leading to spuriously significant results.
The problem is sometimes called the “winner’s curse” of variable selection: variables that happen to look important in the particular sample are more likely to be selected, and their coefficients are biased upward in absolute value. Naively running OLS on the selected variables inherits this bias without correcting for it.
Belloni, Chernozhukov, and Hansen (2014) developed the post-double-selection method as a principled solution. The key insight is that in a causal inference setting with a treatment variable D, an outcome Y, and many potential controls X, you need to select controls that are relevant to both Y and D. The procedure is:
- Run LASSO of Y on X — this selects controls predictive of the outcome.
- Run LASSO of D on X — this selects controls predictive of treatment (potential confounders).
- Take the union of both selected sets.
- Run OLS of Y on D and the union of selected controls. The resulting coefficient on D has valid standard errors.
The double selection step is essential: selecting controls for the outcome alone (step 1) might miss confounders that strongly predict treatment but weakly predict the outcome. Including them in step 2 ensures that the causal estimate of D’s effect is not confounded. This approach is implemented in Stata’s pdslasso command and R’s hdm package. A more general framework — Double/Debiased Machine Learning (DML) — extends these ideas and is covered in detail in Module 11D.
import numpy as np
from sklearn.linear_model import LassoCV
import statsmodels.api as sm
# Simulate causal inference setting
np.random.seed(42)
n = 500
X = np.random.randn(n, 50)
# Treatment depends on X[0] and X[1]
d = 0.5*X[:, 0] + 0.3*X[:, 1] + np.random.randn(n)*0.5
# Outcome depends on treatment, X[0], X[1], and X[2]
# TRUE treatment effect = 1.0
y = 1.0*d + 2*X[:, 0] + 1.5*X[:, 1] + 1*X[:, 2] + np.random.randn(n)*0.5
# Step 1: LASSO of Y on X (select controls for outcome)
lasso_y = LassoCV(cv=5).fit(X, y)
sel_y = set(np.where(lasso_y.coef_ != 0)[0])
# Step 2: LASSO of D on X (select controls for treatment)
lasso_d = LassoCV(cv=5).fit(X, d)
sel_d = set(np.where(lasso_d.coef_ != 0)[0])
# Step 3: Take the union
sel_union = sorted(sel_y | sel_d)
print(f"Selected for Y: {sorted(sel_y)}")
print(f"Selected for D: {sorted(sel_d)}")
print(f"Union: {sel_union}")
# Step 4: OLS of Y on D and union of selected controls
X_selected = X[:, sel_union]
X_ols = sm.add_constant(np.column_stack([d, X_selected]))
ols_result = sm.OLS(y, X_ols).fit()
print(f"\nPost-double-selection estimate of treatment effect:")
print(f" Coefficient: {ols_result.params[1]:.4f}")
print(f" Std error: {ols_result.bse[1]:.4f}")
print(f" 95% CI: [{ols_result.conf_int()[1][0]:.4f}, {ols_result.conf_int()[1][1]:.4f}]")
print(f" (True effect: 1.0)")
* Simulate causal inference setting
clear
set seed 42
set obs 500
forvalues i = 1/50 {
gen x`i' = rnormal()
}
gen d = 0.5*x1 + 0.3*x2 + rnormal()*0.5
gen y = 1.0*d + 2*x1 + 1.5*x2 + 1*x3 + rnormal()*0.5
* Post-double-selection LASSO
* pdslasso automatically performs both selection steps
pdslasso y d (x1-x50)
* The output shows the treatment effect with valid standard errors
display "True treatment effect: 1.0"
library(hdm)
# Simulate causal inference setting
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 50), ncol = 50)
d <- 0.5*X[,1] + 0.3*X[,2] + rnorm(n)*0.5
y <- 1.0*d + 2*X[,1] + 1.5*X[,2] + 1*X[,3] + rnorm(n)*0.5
# Post-double-selection: automatic procedure
result <- rlassoEffect(x = X, y = y, d = d, method = "double selection")
cat("Post-double-selection results:\n")
print(summary(result))
cat("\n(True treatment effect: 1.0)\n")
Selected for Y: [0, 1, 2, 14] Selected for D: [0, 1] Union: [0, 1, 2, 14] Post-double-selection estimate of treatment effect: Coefficient: 1.0218 Std error: 0.0453 95% CI: [0.9328, 1.1108] (True effect: 1.0)
Post-double-selection results:
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
d | 1.01834 .044721 22.77 0.000 .93053 1.10615
-----------+----------------------------------------------------------------
Selected controls: x1 x2 x3
True treatment effect: 1.0Post-double-selection results: Estimate: 1.0195 Std. Error: 0.0461 t value: 22.12 Pr(>|t|): < 2e-16 (True treatment effect: 1.0)
Choosing the Right Method
Ridge (L2)
- Many predictors, all potentially relevant
- Highly correlated features
- Does not perform variable selection
- Shrinks coefficients but keeps all
- Generally best when p < n and features are correlated
LASSO (L1)
- Many predictors, only a few matter (sparsity)
- Automatic variable selection
- Sets irrelevant coefficients to exactly zero
- May be unstable with correlated features
- Best when you believe the true model is sparse
Elastic Net
- Many correlated predictors and you want selection
- Compromise between Ridge and LASSO
- Groups correlated variables together
- More stable than LASSO with correlations
- Requires tuning two parameters (λ and α)
Limitations
All three regularization methods share several important limitations that you should keep in mind:
- Linearity assumption: Ridge, LASSO, and Elastic Net all assume a linear relationship between features and the outcome. They cannot capture nonlinearities or interactions unless you manually create polynomial or interaction features. For genuinely nonlinear relationships, consider tree-based methods (Module 11B) or neural networks (Module 11C).
- Feature scaling matters: Because the penalty depends on coefficient magnitudes, you must always standardize your features before fitting regularized models. If you forget, features measured on larger scales will be penalized more heavily, regardless of their importance.
- Cannot handle categorical variables natively: Categorical variables must be one-hot encoded. When a categorical variable has many levels, LASSO may select some dummies and not others from the same category, which can be hard to interpret. Group LASSO (not covered here) addresses this.
- LASSO is not consistent for variable selection in all settings: The LASSO can fail to select the correct variables when the “irrepresentable condition” is violated — roughly, when irrelevant variables are highly correlated with relevant ones.
References
- Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B, 58(1), 267–288.
- Hoerl, A. & Kennard, R. (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, 12(1), 55–67.
- Zou, H. & Hastie, T. (2005). “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
- Belloni, A., Chernozhukov, V., & Hansen, C. (2014). “Inference on Treatment Effects after Selection among High-Dimensional Controls.” Review of Economic Studies, 81(2), 608–650.
- Leeb, H. & Pötscher, B. (2005). “Model Selection and Inference: Facts and Fiction.” Econometric Theory, 21(1), 21–59.