11 Machine Learning
Learning Objectives
- Distinguish prediction from causal inference
- Understand the bias-variance tradeoff
- Implement supervised learning algorithms
- Evaluate models properly with cross-validation
- Apply ML to economics and social science problems
This module assumes familiarity with causal inference concepts (Module 5) and estimation methods (Module 6). Understanding the difference between prediction and causation is central to this material.
Table of Contents
10.1 Prediction vs. Causation
Machine learning optimizes for prediction: given inputs X, predict output Y as accurately as possible. This is fundamentally different from causal inference, which asks: what happens to Y if we change X?
| Aspect | Prediction (ML) | Causal Inference |
|---|---|---|
| Goal | Minimize prediction error | Estimate treatment effect |
| Question | "What Y given X?" | "What if we changed X?" |
| Coefficients | Unimportant if prediction works | The whole point |
| Confounders | Include if predictive | Must address carefully |
| Overfitting | Primary concern | Less central |
A model that perfectly predicts hospital deaths might find "being on a ventilator" is associated with dying. Should we remove ventilators? Of course not--the relationship isn't causal. ML finds correlations; interpreting them causally requires additional assumptions.
10.2 The Bias-Variance Tradeoff
Prediction error can be decomposed into three components:
- Bias: Error from wrong assumptions (underfitting)
- Variance: Error from sensitivity to training data (overfitting)
- Irreducible: Noise inherent in the data
| Simple Model | Complex Model | |
|---|---|---|
| Bias | High (misses patterns) | Low (captures patterns) |
| Variance | Low (stable) | High (sensitive to data) |
| Risk | Underfitting | Overfitting |
10.3 Supervised Learning
Supervised learning learns from labeled examples: given (X, Y) pairs, learn f such that f(X) is approximately Y.
Linear and Logistic Regression
# Python: Supervised learning with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
import pandas as pd
# Load and prepare data
df = pd.read_csv("data.csv")
X = df[['age', 'education', 'experience']]
y = df['income']
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"RMSE: {rmse:.2f}")
# For classification, use LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train_binary)
accuracy = accuracy_score(y_test_binary, clf.predict(X_test))
# R: Supervised learning with tidymodels
library(tidymodels)
# Load data
df <- read_csv("data.csv")
# Split data
set.seed(42)
split <- initial_split(df, prop = 0.8)
train <- training(split)
test <- testing(split)
# Define model
lm_spec <- linear_reg() %>%
set_engine("lm")
# Fit model
lm_fit <- lm_spec %>%
fit(income ~ age + education + experience, data = train)
# Predict and evaluate
predictions <- predict(lm_fit, test)
results <- test %>% bind_cols(predictions)
rmse <- results %>%
rmse(truth = income, estimate = .pred)
print(rmse)
10.4 Model Evaluation
Train-Test Split
Never evaluate on training data! The model memorizes training examples, so training accuracy is misleading. Always hold out a test set that the model never sees during training.
Cross-Validation
K-fold cross-validation provides more robust estimates by training K models, each holding out a different fold.
# Python: K-fold cross-validation
from sklearn.model_selection import cross_val_score, KFold
# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
print(f"CV RMSE: {(-scores.mean())**0.5:.2f} (+/- {scores.std()**0.5:.2f})")
Metrics
| Task | Metric | Description |
|---|---|---|
| Regression | RMSE | Root Mean Squared Error |
| MAE | Mean Absolute Error | |
| R^2 | Variance explained | |
| Classification | Accuracy | Percent correct |
| Precision | TP / (TP + FP) | |
| Recall | TP / (TP + FN) | |
| AUC-ROC | Area under ROC curve |
10.5 Regularization
Regularization prevents overfitting by penalizing model complexity.
| Method | Penalty | Effect |
|---|---|---|
| Ridge (L2) | lambda * sum(beta^2) | Shrinks coefficients toward zero |
| LASSO (L1) | lambda * sum(|beta|) | Sets some coefficients exactly to zero (variable selection) |
| Elastic Net | alpha(L1) + (1-alpha)(L2) | Combines both |
# Python: LASSO with cross-validated lambda
from sklearn.linear_model import LassoCV
# LASSO with automatic lambda selection
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)
print(f"Best lambda: {lasso.alpha_:.4f}")
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()}")
10.6 Tree-Based Methods
Decision trees partition the feature space into regions, predicting the mean (regression) or mode (classification) in each region.
Random Forests
Random forests average many trees, each trained on a bootstrap sample with random feature subsets. This reduces variance dramatically.
Gradient Boosting
Gradient boosting builds trees sequentially, with each tree correcting the errors of previous trees. XGBoost and LightGBM are popular implementations.
# Python: Random Forest and XGBoost
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
# Random Forest
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
# Feature importance
importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
# XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5)
xgb_model.fit(X_train, y_train)
10.7 Neural Networks
Neural networks learn hierarchical representations through layers of nonlinear transformations. Deep learning refers to networks with many layers.
# Python: Neural network with PyTorch
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
return x
# Create model
model = SimpleNN(input_size=10, hidden_size=64, output_size=1)
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
10.8 ML for Causal Inference
A growing field combines ML's predictive power with causal inference frameworks. Key approaches:
| Method | Use Case |
|---|---|
| Double ML (DML) | Use ML to partial out confounders, then estimate treatment effect |
| Causal Forests | Estimate heterogeneous treatment effects |
| LASSO for Controls | Variable selection for control variables |
| Synthetic Control | ML-weighted comparison units |
- Athey, S. & Imbens, G. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics.
- Chernozhukov, V., et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. Econometrics Journal.
- Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. JASA.
10.9 Exercises
Exercise 10.1: Machine Learning Workflow
Practice a basic supervised learning workflow. Complete these tasks:
- Split data into training and test sets
- Fit a model (linear regression or random forest)
- Evaluate model performance (RMSE or accuracy)