11 Machine Learning
Learning Objectives
- Distinguish prediction tasks from causal inference tasks
- Understand the bias-variance tradeoff and why it matters
- Know when to use regularization, trees, or neural networks
- Apply ML methods to economics and social science problems
- Use ML as a tool for causal inference (DML, causal forests)
- Evaluate and compare models properly
Prerequisites
This module assumes familiarity with linear regression, hypothesis testing, and basic causal inference concepts from Modules 5–7 (Data Analysis, Causal Inference, Estimation Methods). You should be comfortable running regressions and interpreting coefficients in at least one of the three course languages.
In This Module
11.1 Why Machine Learning for Economists?
Economists have traditionally focused on causal inference: understanding why things happen and estimating the effect of specific interventions. This is the domain of the randomized experiment, the instrumental variable, and the difference-in-differences design. But a growing body of research recognizes that many important policy and research questions are fundamentally about prediction, not causation. When a government agency needs to decide which households should receive benefits, the question is not “what is the causal effect of income on poverty?” but rather “which households are most likely to be poor?” When a judge must decide whether to grant bail, the relevant question is “how likely is this person to reoffend?” — a prediction task.
Mullainathan and Spiess (2017) formalized this insight with the concept of prediction policy problems. They argue that whenever the policy-relevant quantity is a prediction — rather than a causal parameter — machine learning methods are the natural tool. Traditional econometric models like OLS are designed to produce unbiased coefficient estimates, not to minimize prediction error. ML methods, by contrast, are explicitly optimized for predictive accuracy. They can capture complex nonlinear relationships, handle hundreds or thousands of predictors, and automatically determine which variables matter most — all without requiring the researcher to specify the functional form in advance.
The range of prediction tasks relevant to economists is broader than you might initially think. Text classification (is this central bank statement hawkish or dovish?), image recognition (can satellite imagery predict economic development?), nowcasting economic indicators (what is GDP this quarter, before official data arrives?), and targeting interventions (which students are at risk of dropping out?) are all examples where ML outperforms traditional approaches. Even in the medical sciences, ML models for diagnosis and risk prediction are proving transformative. In each of these cases, the researcher does not need to know why certain features predict the outcome — only that they do.
Crucially, machine learning complements rather than replaces econometric methods. It is not a substitute for a well-identified causal design when your question is about causation. Instead, ML provides the right toolkit for prediction problems and, increasingly, serves as an input to causal estimation itself. In Module 11D, you will see how methods like Double/Debiased Machine Learning (DML) and causal forests use ML predictions as building blocks for valid causal inference. Understanding ML is therefore essential for the modern applied economist — not as a replacement for your existing skills, but as a powerful addition to your methodological toolkit.
11.2 Prediction vs. Causation
The most important conceptual distinction in this module is between prediction and causation. Prediction asks: “Given what I observe about this individual, what is my best guess for their outcome Y?” Causation asks: “If I change X, what happens to Y?” These are fundamentally different questions, and the tools optimized for one are generally not optimal for the other.
Consider a concrete example. Suppose you have data on student characteristics and their exam scores. A prediction question would be: “Given a new student’s GPA, attendance record, and reported study hours, what score will they get on the final exam?” A causal question would be: “If we increase study hours by one hour per week, how much does the exam score improve?” For the prediction question, it does not matter whether GPA causes higher scores or is merely correlated with them — all that matters is that GPA helps us predict the outcome. For the causal question, we need to worry about confounders (perhaps more motivated students both study more and score higher), reverse causality, and identification strategies.
| Prediction | Causation | |
|---|---|---|
| Goal | Minimize forecast error | Estimate a causal effect |
| Method | Any model that predicts well | Requires an identification strategy |
| Evaluation | Out-of-sample accuracy (RMSE, AUC) | Unbiasedness, consistency |
| Coefficients | Not individually interpretable | Each has structural meaning |
| Confounders | Helpful (more predictive power) | Must be controlled for or eliminated |
In a prediction framework, we do not care about the interpretability of individual coefficients. A model with 500 features and complex interactions can be perfectly valid if it predicts well on new data. What matters is out-of-sample accuracy: how well does the model perform on data it has never seen? This is a fundamentally different criterion than the unbiasedness or consistency that we prize in causal inference.
Athey and Imbens (2019) emphasize that ML methods are explicitly optimized for prediction, not for causal estimation. Using them naively for causal questions — for example, interpreting a random forest’s variable importance as a causal effect — can be deeply misleading. The variable importance might reflect correlation, reverse causation, or confounding rather than a genuine causal relationship. Throughout this module, we will be careful to distinguish when we are using ML for prediction versus when we are using ML-augmented methods that maintain causal validity.
A Basic Prediction Pipeline
Let us set up the fundamental workflow that underlies every ML task: split the data into training and test sets, fit a model on the training data, and evaluate its performance on the test data. This train/test paradigm is the foundation of everything that follows.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10) # 10 features
# True relationship: only first 3 features matter
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5
# Step 1: Split into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 2: Fit model on training data only
model = LinearRegression()
model.fit(X_train, y_train)
# Step 3: Predict on test data
y_pred = model.predict(X_test)
# Step 4: Evaluate out-of-sample performance
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Training observations: {len(X_train)}")
print(f"Test observations: {len(X_test)}")
print(f"Test RMSE: {rmse:.4f}")
print(f"R-squared (test): {model.score(X_test, y_test):.4f}")
* Generate simulated data
clear
set seed 42
set obs 500
* Create 10 features
forvalues i = 1/10 {
gen x`i' = rnormal()
}
* True relationship: only first 3 features matter
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5
* Step 1: Create a train/test split indicator
gen split = runiform()
gen test = (split > 0.8)
* Step 2: Fit model on training data only
regress y x1-x10 if test == 0
* Step 3: Predict on all data (including test)
predict yhat
* Step 4: Evaluate on test set only
gen resid_sq = (y - yhat)^2 if test == 1
summarize resid_sq if test == 1
display "Test RMSE: " sqrt(r(mean))
count if test == 0
display "Training obs: " r(N)
count if test == 1
display "Test obs: " r(N)
library(rsample)
library(yardstick)
# Generate simulated data
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)
# Step 1: Split into training (80%) and test (20%)
split <- initial_split(df, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)
# Step 2: Fit model on training data
model <- lm(y ~ ., data = train_data)
# Step 3: Predict on test data
y_pred <- predict(model, newdata = test_data)
# Step 4: Evaluate out-of-sample performance
rmse_val <- sqrt(mean((test_data$y - y_pred)^2))
cat("Training observations:", nrow(train_data), "\n")
cat("Test observations: ", nrow(test_data), "\n")
cat("Test RMSE: ", round(rmse_val, 4), "\n")
r2 <- 1 - sum((test_data$y - y_pred)^2) / sum((test_data$y - mean(test_data$y))^2)
cat("R-squared (test): ", round(r2, 4), "\n")
Training observations: 400 Test observations: 100 Test RMSE: 0.5182 R-squared (test): 0.9571
Source | SS df MS
-------------+----------------------------------
Model | 2348.12741 10 234.812741
Residual | 96.5488209 389 .248197483
-------------+----------------------------------
Total | 2444.67623 399 6.12700810
Test RMSE: .50943
Training obs: 401
Test obs: 99Training observations: 400 Test observations: 100 Test RMSE: 0.5237 R-squared (test): 0.9548
11.3 The Bias-Variance Tradeoff
The bias-variance tradeoff is arguably the single most important concept in machine learning. It explains why simple models sometimes outperform complex ones, why fitting your training data perfectly is usually a mistake, and how to think about the fundamental tension between underfitting and overfitting. Every ML method you will learn in this module — from LASSO to random forests to neural networks — can be understood as a particular way of navigating this tradeoff.
The key insight is that the expected prediction error for any model can be decomposed into three components:
Bias measures how far off our model’s average prediction is from the truth. A model with high bias makes systematically wrong predictions because it is too simple to capture the true relationship. For instance, fitting a straight line to data that follows a curve produces high bias — the model simply cannot represent the underlying pattern, no matter how much data we give it. This is underfitting.
Variance measures how much our model’s predictions fluctuate across different training samples. A model with high variance is highly sensitive to the particular data points it was trained on. If you were to re-train the model on a slightly different random sample, the predictions would change dramatically. A 20th-degree polynomial that wiggles through every training point has high variance — it has essentially memorized the noise in that particular sample. This is overfitting.
The third term, σ², is the irreducible error — the inherent randomness in the data that no model can eliminate. Even with a perfect model and infinite data, you cannot predict an outcome perfectly if there is genuine noise in the system. This term sets a floor on how well any model can perform.
The fundamental tradeoff is this: as model complexity increases, bias decreases (the model can capture more complex patterns) but variance increases (the model becomes more sensitive to the training data). The optimal model is the one that minimizes the total error — the sweet spot where the sum of bias-squared and variance is at its minimum. Every ML method has one or more tuning parameters (hyperparameters) that control this tradeoff: the penalty parameter λ in LASSO, the maximum depth in decision trees, the number of hidden layers in a neural network.
The Bias-Variance Tradeoff
Demonstrating Overfitting with Polynomial Fits
Let us make this concrete. We will generate data from a noisy sine curve, then fit polynomials of increasing degree (1, 5, and 15). You will see that the degree-1 polynomial underfits (too simple), the degree-15 polynomial overfits (wiggles through every point), and the degree-5 polynomial hits a reasonable middle ground.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
np.random.seed(42)
# Generate noisy sine data
n_train, n_test = 30, 200
X_train = np.sort(np.random.uniform(0, 2 * np.pi, n_train)).reshape(-1, 1)
X_test = np.linspace(0, 2 * np.pi, n_test).reshape(-1, 1)
y_train = np.sin(X_train).ravel() + np.random.randn(n_train) * 0.3
y_test = np.sin(X_test).ravel() + np.random.randn(n_test) * 0.3
# Fit polynomials of degree 1, 5, and 15
print(f"{'Degree':<10} {'Train MSE':<15} {'Test MSE':<15}")
print("-" * 40)
for degree in [1, 5, 15]:
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
test_mse = mean_squared_error(y_test, model.predict(X_test_poly))
print(f"{degree:<10} {train_mse:<15.4f} {test_mse:<15.4f}")
* Generate noisy sine data
clear
set seed 42
set obs 230
* First 30 obs = training, rest = test
gen id = _n
gen test = (id > 30)
gen x = runiform() * 2 * _pi if test == 0
replace x = (id - 30) / 200 * 2 * _pi if test == 1
gen y = sin(x) + rnormal() * 0.3
* Generate polynomial terms
forvalues p = 2/15 {
gen x`p' = x^`p'
}
* Fit polynomial regressions of increasing degree
display "Degree Train MSE Test MSE"
display "----------------------------------------"
* Degree 1
quietly regress y x if test == 0
predict yhat1
gen se1_train = (y - yhat1)^2 if test == 0
gen se1_test = (y - yhat1)^2 if test == 1
quietly summarize se1_train
local tr1 = r(mean)
quietly summarize se1_test
display "1 " %9.4f `tr1' " " %9.4f r(mean)
* Degree 5
quietly regress y x x2-x5 if test == 0
predict yhat5
gen se5_train = (y - yhat5)^2 if test == 0
gen se5_test = (y - yhat5)^2 if test == 1
quietly summarize se5_train
local tr5 = r(mean)
quietly summarize se5_test
display "5 " %9.4f `tr5' " " %9.4f r(mean)
* Degree 15
quietly regress y x x2-x15 if test == 0
predict yhat15
gen se15_train = (y - yhat15)^2 if test == 0
gen se15_test = (y - yhat15)^2 if test == 1
quietly summarize se15_train
local tr15 = r(mean)
quietly summarize se15_test
display "15 " %9.4f `tr15' " " %9.4f r(mean)
set.seed(42)
# Generate noisy sine data
n_train <- 30; n_test <- 200
x_train <- sort(runif(n_train, 0, 2 * pi))
x_test <- seq(0, 2 * pi, length.out = n_test)
y_train <- sin(x_train) + rnorm(n_train, sd = 0.3)
y_test <- sin(x_test) + rnorm(n_test, sd = 0.3)
# Fit polynomials of degree 1, 5, and 15
degrees <- c(1, 5, 15)
results <- data.frame(Degree = integer(), Train_MSE = numeric(), Test_MSE = numeric())
for (d in degrees) {
# Fit polynomial regression
model <- lm(y_train ~ poly(x_train, d, raw = TRUE))
# Predict on train and test
pred_train <- predict(model)
pred_test <- predict(model, newdata = data.frame(x_train = x_test))
# Compute MSE
train_mse <- mean((y_train - pred_train)^2)
test_mse <- mean((y_test - pred_test)^2)
results <- rbind(results, data.frame(Degree = d, Train_MSE = train_mse, Test_MSE = test_mse))
}
print(results, row.names = FALSE)
Degree Train MSE Test MSE ---------------------------------------- 1 0.2194 0.2487 5 0.0621 0.0845 15 0.0298 0.5731
Degree Train MSE Test MSE ---------------------------------------- 1 0.2237 0.2512 5 0.0654 0.0891 15 0.0312 0.5843
Degree Train_MSE Test_MSE
1 0.21506 0.24632
5 0.06348 0.08714
15 0.03012 0.56928Notice the pattern: training MSE always decreases as the polynomial degree increases (the model fits the training data better and better). But test MSE first decreases (from degree 1 to degree 5) and then increases sharply (from degree 5 to degree 15). The degree-15 polynomial has essentially memorized the 30 training points, including their noise, and makes wild predictions on new data. This is overfitting in action, and it is precisely the bias-variance tradeoff at work.
11.4 Supervised vs. Unsupervised Learning
Machine learning methods are broadly divided into two families: supervised learning and unsupervised learning. The distinction is straightforward: in supervised learning, we have a labeled outcome variable Y that we want to predict from features X. In unsupervised learning, there is no outcome variable — we are simply looking for patterns and structure in the data. The name “supervised” comes from the idea that the labels “supervise” the learning process by telling the algorithm what the right answer is.
Within supervised learning, there are two main subtypes. Regression problems have a continuous outcome (predicting house prices, GDP growth, or exam scores). Classification problems have a categorical outcome (classifying emails as spam or not, predicting whether a loan will default, or identifying the sentiment of a text). The same algorithms (like decision trees or neural networks) can often handle both types, though the loss function and evaluation metrics differ. For regression, we typically use RMSE or MAE; for classification, accuracy, precision, recall, or AUC.
When one class dominates the data (e.g., 95% of loans do not default), a model that always predicts the majority class achieves 95% accuracy while being completely useless. For imbalanced classification, use precision (of those predicted positive, how many truly are), recall (of actual positives, how many were caught), or AUC (how well the model discriminates across all thresholds). See Module 11E: Model Evaluation for details.
Unsupervised learning is used when we want to discover structure without a specific prediction target. Clustering methods (like k-means) group similar observations together — for example, segmenting countries by economic characteristics or grouping consumers by purchasing behavior. Dimensionality reduction methods (like PCA) compress many variables into a smaller number of components that capture the most important variation. For economists, most ML applications in research are supervised — predicting outcomes, classifying text, or using ML predictions as inputs to causal estimation. However, unsupervised methods appear in text analysis (topic models), factor models, and exploratory data analysis.
| Supervised | Unsupervised | |
|---|---|---|
| Goal | Predict Y from X | Find structure in X |
| Data | Labeled (X, Y) | Unlabeled (X only) |
| Examples | LASSO, Random Forest, Neural Nets | K-means, PCA, Topic Models |
| Evaluation | Compare predictions to known labels | Internal metrics (silhouette, variance explained) |
| Economics use | Prediction policy, causal ML | Text clustering, factor models |
11.5 Cross-Validation & Model Selection
We have established that out-of-sample performance is what matters for prediction. But a single train/test split has two problems. First, it wastes data — the observations in the test set are never used for training. Second, the results depend on the particular random split: you might get a favorable split where the test set happens to be easy, or an unfavorable one where it is unusually hard. K-fold cross-validation solves both problems by systematically rotating which data is used for training and testing.
The procedure works as follows. Divide your data into K equally sized “folds” (groups). For each fold k = 1, ..., K: train the model on all data except fold k, then evaluate it on fold k. After all K iterations, every observation has been used exactly once as a test point. The average of the K test errors gives you a robust estimate of out-of-sample performance. Typical choices are K = 5 or K = 10. The extreme case K = N (called leave-one-out cross-validation, or LOOCV) trains the model N times, each time leaving out a single observation. LOOCV has the lowest bias but can be computationally expensive and sometimes has higher variance than K = 10.
5-Fold Cross-Validation
Each observation is used as a test point exactly once. The final CV score is the average of all 5 test errors.
Cross-validation serves two critical purposes. First, it gives you a reliable estimate of how your model will perform on truly new data. Second, it allows you to choose hyperparameters — settings that control the model’s complexity. For instance, in LASSO regression (Module 11A), the penalty parameter λ controls how much regularization to apply. You can fit the LASSO for many different values of λ, compute the CV error for each, and choose the λ that minimizes the CV error. This data-driven approach to model selection is far more principled than eyeballing the results or relying on arbitrary rules of thumb.
An important practical point: always do your cross-validation before looking at the test set. The test set should be touched only once, at the very end, to report your model’s final performance. If you repeatedly evaluate on the test set and adjust your model, you are effectively training on the test set, and your reported performance will be overly optimistic. Some practitioners advocate for a three-way split: training set (for fitting), validation set (for hyperparameter tuning), and test set (for final evaluation). Cross-validation on the training set replaces the need for a separate validation set.
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
# Use the same simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)
y = 2 * X[:, 0] + 1.5 * X[:, 1] - 1 * X[:, 2] + np.random.randn(n) * 0.5
# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LinearRegression()
# cross_val_score returns negative MSE (sklearn convention: higher = better)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
# Convert to positive RMSE
rmse_scores = np.sqrt(-scores)
print("RMSE for each fold:")
for i, score in enumerate(rmse_scores, 1):
print(f" Fold {i}: {score:.4f}")
print(f"\nMean RMSE: {rmse_scores.mean():.4f}")
print(f"Std RMSE: {rmse_scores.std():.4f}")
* Manual 5-fold cross-validation in Stata
clear
set seed 42
set obs 500
* Generate data
forvalues i = 1/10 {
gen x`i' = rnormal()
}
gen y = 2*x1 + 1.5*x2 - 1*x3 + rnormal()*0.5
* Assign observations to 5 folds randomly
gen u = runiform()
sort u
gen fold = mod(_n - 1, 5) + 1
* Perform 5-fold CV
gen cv_resid_sq = .
forvalues k = 1/5 {
quietly regress y x1-x10 if fold != `k'
predict yhat_temp if fold == `k'
replace cv_resid_sq = (y - yhat_temp)^2 if fold == `k'
drop yhat_temp
}
* Report RMSE per fold and overall
display "RMSE for each fold:"
forvalues k = 1/5 {
quietly summarize cv_resid_sq if fold == `k'
display " Fold `k': " %7.4f sqrt(r(mean))
}
quietly summarize cv_resid_sq
display _newline "Mean RMSE: " %7.4f sqrt(r(mean))
library(rsample)
# Generate data
set.seed(42)
n <- 500
X <- matrix(rnorm(n * 10), ncol = 10)
colnames(X) <- paste0("x", 1:10)
y <- 2 * X[, 1] + 1.5 * X[, 2] - 1 * X[, 3] + rnorm(n) * 0.5
df <- data.frame(y = y, X)
# Create 5-fold CV splits
folds <- vfold_cv(df, v = 5)
# Compute RMSE for each fold
rmse_vals <- sapply(1:5, function(i) {
train_data <- analysis(folds$splits[[i]])
test_data <- assessment(folds$splits[[i]])
model <- lm(y ~ ., data = train_data)
preds <- predict(model, newdata = test_data)
sqrt(mean((test_data$y - preds)^2))
})
cat("RMSE for each fold:\n")
for (i in 1:5) {
cat(sprintf(" Fold %d: %.4f\n", i, rmse_vals[i]))
}
cat(sprintf("\nMean RMSE: %.4f\n", mean(rmse_vals)))
cat(sprintf("Std RMSE: %.4f\n", sd(rmse_vals)))
RMSE for each fold: Fold 1: 0.4973 Fold 2: 0.5241 Fold 3: 0.5108 Fold 4: 0.4856 Fold 5: 0.5322 Mean RMSE: 0.5100 Std RMSE: 0.0173
RMSE for each fold: Fold 1: 0.5012 Fold 2: 0.5198 Fold 3: 0.5067 Fold 4: 0.4921 Fold 5: 0.5284 Mean RMSE: 0.5096
RMSE for each fold: Fold 1: 0.5031 Fold 2: 0.5187 Fold 3: 0.5095 Fold 4: 0.4889 Fold 5: 0.5299 Mean RMSE: 0.5100 Std RMSE: 0.0139
11.6 Module Roadmap
This module is organized into five subpages, each covering a major family of ML methods. Work through them in order, as later sections build on concepts from earlier ones.
Ridge, LASSO, Elastic Net, Post-LASSO for causal inference
~2.5 hours 11B: Tree-Based MethodsDecision Trees, Random Forests, Gradient Boosting (XGBoost)
~2.5 hours 11C: Neural NetworksPerceptrons, backpropagation, deep learning fundamentals
~2.5 hours 11D: Causal MLDouble/Debiased ML, Causal Forests, heterogeneous treatment effects
~2.5 hours 11E: Model EvaluationMetrics, confusion matrices, ROC curves, calibration, comparing models
~2 hoursComparing the Major Method Families
Regularization (11A)
- Linear models with penalty terms
- Fast, interpretable, well-understood theory
- Best when relationship is approximately linear
- LASSO provides automatic variable selection
Trees (11B)
- Nonparametric, capture interactions naturally
- Random forests and boosting are very powerful
- No need to specify functional form
- Often best “out-of-the-box” performance
Neural Networks (11C)
- Universal function approximators
- Excel with unstructured data (text, images)
- Require large datasets and careful tuning
- Less common in applied economics (so far)
References
- Mullainathan, S. & Spiess, J. (2017). “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives, 31(2), 87–106.
- Athey, S. & Imbens, G. (2019). “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics, 11, 685–725.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. 2nd ed. Springer.