11E Model Evaluation
Learning Objectives
- Understand why evaluating on training data is misleading and how overfitting distorts performance estimates
- Implement train-test splits and K-fold cross-validation in Python, Stata, and R
- Compute and interpret regression metrics: RMSE, MAE, R-squared, and MAPE
- Compute and interpret classification metrics: accuracy, precision, recall, F1, and AUC-ROC
- Recognize and prevent data leakage -- the most common source of invalid ML results
- Use learning curves to diagnose underfitting vs. overfitting
- Use proper evaluation strategies to compare and select among competing models
Contents
Why Evaluation Matters
Every model you have encountered in this course -- from OLS regression to LASSO, random forests, and neural networks -- needs to be evaluated. But evaluated on what? The answer to this question is one of the most important ideas in all of machine learning, and it separates rigorous empirical work from misleading results.
The central problem is overfitting. A sufficiently flexible model -- a deep neural network, a random forest with many trees, or even an OLS regression with hundreds of interaction terms -- can always achieve excellent performance on the data it was trained on. It does this by memorizing the noise, the outliers, and the quirks of that specific dataset. The model's apparent "accuracy" is an illusion: it reflects the model's ability to recall the training data, not its ability to predict new, unseen observations.
The Exam Analogy
Think of model evaluation like preparing for an exam. A student who memorizes every question and answer from past exams (training data) will score perfectly on those same exams -- but may fail completely when given new questions (test data) that require genuine understanding. The only honest measure of learning is performance on questions the student has never seen before. Machine learning models work the same way.
Consider a simple example. Suppose you want to predict GDP growth using 20 macroeconomic indicators, and your dataset contains 50 country-year observations. If you fit a polynomial regression with enough interaction terms, you can achieve an R-squared of 1.0 on the training data -- a "perfect" fit. But this model has essentially memorized your 50 data points. Give it data from a new year or a new country, and its predictions will be wildly inaccurate. The training-set R-squared of 1.0 told you nothing useful about the model's real predictive ability.
The Cardinal Rule of Model Evaluation
Never evaluate a model on the same data used to train it. Training-set performance is always optimistically biased. The more flexible the model, the larger the gap between training performance and true out-of-sample performance. This is not a minor technicality -- it is the difference between a model that works and one that only appears to work.
The solution is conceptually simple: set aside data that the model has never seen during training, and evaluate the model's predictions on that held-out data. The specific strategies for doing this -- train-test splits and cross-validation -- are the subject of the next two sections. But the underlying principle is always the same: the only honest measure of a model's quality is its performance on data it has not been trained on.
This principle matters just as much in economics as in computer science. If you are building a model to predict firm bankruptcy, forecast inflation, or identify households eligible for a targeted policy, you need to know how well it will perform when deployed on new data. Reporting only training-set performance is analogous to judging a student by giving them the exact same exam they used to study -- it tells you about memorization, not understanding.
Train-Test Split
The simplest evaluation strategy is to split your data into two parts: a training set used to fit the model, and a test set (sometimes called a "hold-out set") used to evaluate it. The model never sees the test data during training -- it is kept completely separate, like sealing an exam in an envelope until the student has finished studying.
A typical split is 80% training and 20% test, though the exact ratio depends on the size of your dataset. With very large datasets (hundreds of thousands of observations), even a 90/10 split leaves plenty of test data. With small datasets (a few hundred observations), you may need cross-validation instead (covered in the next section), because holding out 20% leaves too few observations for both reliable training and reliable evaluation.
Two important implementation details. First, the split should be random -- if your data is sorted (by time, by country, by outcome), a non-random split would systematically exclude certain types of observations from training or testing, leading to biased results. Second, you should set a random seed so that your results are reproducible. Anyone running your code should get the same split and therefore the same results.
When Random Splits Are Wrong
If your data has a time dimension (panel data, time series), a random split can leak future information into the training set. For example, if you randomly split yearly GDP data, the model might train on 2020 data and predict 2018 -- it has seen the future. In these cases, use a temporal split: train on earlier periods and test on later ones. This mimics the real prediction task, where you only have past data available.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
# Load California Housing data
housing = fetch_california_housing(as_frame=True)
X = housing.data # 20,640 observations, 8 features
y = housing.target # Median house value (in $100k)
# Split: 80% training, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(f"Full dataset: {X.shape[0]:,} observations")
print(f"Training set: {X_train.shape[0]:,} observations ({X_train.shape[0]/X.shape[0]:.0%})")
print(f"Test set: {X_test.shape[0]:,} observations ({X_test.shape[0]/X.shape[0]:.0%})")
print(f"\nFeatures: {X.shape[1]}")
print(f"Target mean: {y.mean():.3f} (train: {y_train.mean():.3f}, test: {y_test.mean():.3f})")
* Stata: Train-test split using random number generation
* Load example data
sysuse auto, clear
* Set seed for reproducibility
set seed 42
* Generate a uniform random variable and split at 80th percentile
gen rand = runiform()
gen byte split = cond(rand <= 0.8, 1, 0)
label define splitlbl 1 "Train" 0 "Test"
label values split splitlbl
* Verify the split
tab split
* Summarize the outcome in each split
tabstat price, by(split) stat(n mean sd min max) format(%9.1f)
* Fit model on training data only
reg price mpg weight length if split == 1
* Generate predictions for ALL observations (train + test)
predict yhat
* Evaluate on TEST set only
gen resid_sq = (price - yhat)^2 if split == 0
quietly summarize resid_sq
display "Test RMSE: " sqrt(r(mean))
# R: Train-test split using rsample (tidymodels framework)
library(rsample)
# Load built-in mtcars data (32 observations, 10 features)
data(mtcars)
# Create an 80/20 split (stratified splits are also possible)
set.seed(42)
split_obj <- initial_split(mtcars, prop = 0.8)
# Extract training and test sets
train_data <- training(split_obj)
test_data <- testing(split_obj)
cat("Full dataset: ", nrow(mtcars), "observations\n")
cat("Training set: ", nrow(train_data), "observations",
sprintf("(%0.0f%%)\n", 100 * nrow(train_data) / nrow(mtcars)))
cat("Test set: ", nrow(test_data), "observations",
sprintf("(%0.0f%%)\n", 100 * nrow(test_data) / nrow(mtcars)))
cat("\nTarget (mpg) mean -- train:", round(mean(train_data$mpg), 2),
" test:", round(mean(test_data$mpg), 2), "\n")
Full dataset: 20,640 observations Training set: 16,512 observations (80%) Test set: 4,128 observations (20%) Features: 8 Target mean: 2.069 (train: 2.072, test: 2.055)
split | Freq. Percent Cum.
-----------+-----------------------------------
Train | 59 79.73 79.73
Test | 15 20.27 100.00
-----------+-----------------------------------
Total | 74 100.00
split | N Mean S.D. Min Max
----------+-----------------------------------------------------
Train | 59 6165.3 2949.5 3291 15906
Test | 15 6229.9 3217.1 3748 14500
----------+-----------------------------------------------------
Total | 74 6178.4 2993.0 3291 15906
Test RMSE: 2461.38Full dataset: 32 observations Training set: 25 observations (78%) Test set: 7 observations (22%) Target (mpg) mean -- train: 19.96 test: 20.93
K-Fold Cross-Validation
A single train-test split has an obvious limitation: the evaluation depends on which observations happened to land in the test set. With a small dataset, a different random split could give substantially different results. You might get lucky (easy test observations) or unlucky (hard ones), and your performance estimate will be noisy.
The Practice Exam Analogy
Cross-validation is like studying with practice exams. Instead of taking one practice test and hoping it covers the right material, you take five different practice tests, each covering a different subset of the material. Your average score across all five gives a much more reliable estimate of how you will perform on the real exam than any single practice test alone.
K-fold cross-validation solves this by using every observation for both training and testing, just not at the same time. The procedure is straightforward: divide the data into K equally-sized subsets (called "folds"). Then, for each fold in turn, train the model on the other K-1 folds and evaluate it on the held-out fold. After K rounds, every observation has been in the test set exactly once. The final performance metric is the average across all K folds.
The most common choice is K = 5 or K = 10. With 5-fold CV, each fold contains 20% of the data, and the model is trained 5 times on 80% of the data. This gives a more reliable performance estimate than a single split, because it averages over 5 different train-test configurations. The cost is computational: you must train the model K times instead of once.
5-Fold Cross-Validation
Each observation appears in the test set exactly once. The final score is the average across all folds -- a more reliable estimate than any single split.
An important variant is stratified K-fold cross-validation, which ensures that each fold has roughly the same distribution of the target variable. This is particularly important for classification problems with imbalanced classes -- if only 5% of firms in your dataset go bankrupt, a non-stratified split might create a fold with no bankrupt firms at all, making evaluation meaningless.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Define two competing models
ols = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# 5-fold cross-validation for both models
# scoring="r2" computes R-squared; scoring="neg_mean_squared_error" gives -MSE
ols_scores = cross_val_score(ols, X, y, cv=5, scoring="r2")
rf_scores = cross_val_score(rf, X, y, cv=5, scoring="r2")
print("5-Fold Cross-Validation: R-squared")
print("=" * 45)
print(f"OLS Regression:")
print(f" Per-fold scores: {ols_scores.round(3)}")
print(f" Mean: {ols_scores.mean():.3f} (Std: {ols_scores.std():.3f})")
print(f"\nRandom Forest:")
print(f" Per-fold scores: {rf_scores.round(3)}")
print(f" Mean: {rf_scores.mean():.3f} (Std: {rf_scores.std():.3f})")
* Stata: K-fold cross-validation using the crossfold command
* Install if needed: ssc install crossfold
sysuse auto, clear
set seed 42
* crossfold performs k-fold CV for regression models
* It reports the out-of-sample RMSE for each fold
crossfold reg price mpg weight length, k(5)
* The output shows RMSE for each fold and the overall average
* You can also do this manually for more control:
* Manual 5-fold CV
gen fold = ceil(5 * runiform())
forvalues k = 1/5 {
quietly reg price mpg weight length if fold != `k'
quietly predict double yhat_`k' if fold == `k'
quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
quietly summarize se_`k'
display "Fold `k': RMSE = " sqrt(r(mean))
}
* Compute overall CV RMSE
egen all_se = rowtotal(se_1-se_5), missing
quietly summarize all_se
display "Overall CV RMSE: " sqrt(r(mean))
# R: K-fold cross-validation using tidymodels
library(rsample)
library(parsnip)
library(yardstick)
library(tune)
library(workflows)
# Load data
data(mtcars)
# Create 5-fold cross-validation object
set.seed(42)
folds <- vfold_cv(mtcars, v = 5)
# Define two models
lm_spec <- linear_reg() |> set_engine("lm")
rf_spec <- rand_forest(trees = 100) |>
set_engine("ranger") |>
set_mode("regression")
# Create workflows (model + formula)
lm_wf <- workflow() |> add_model(lm_spec) |> add_formula(mpg ~ .)
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)
# Fit both models using 5-fold CV
lm_results <- fit_resamples(lm_wf, resamples = folds, metrics = metric_set(rsq, rmse))
rf_results <- fit_resamples(rf_wf, resamples = folds, metrics = metric_set(rsq, rmse))
# Collect and display average metrics
cat("=== OLS Regression ===\n")
print(collect_metrics(lm_results))
cat("\n=== Random Forest ===\n")
print(collect_metrics(rf_results))
5-Fold Cross-Validation: R-squared ============================================= OLS Regression: Per-fold scores: [0.596 0.609 0.593 0.607 0.592] Mean: 0.599 (Std: 0.007) Random Forest: Per-fold scores: [0.805 0.826 0.812 0.817 0.809] Mean: 0.814 (Std: 0.007)
Fold 1: Root Mean Squared Error = 2372.019 Fold 2: Root Mean Squared Error = 2781.445 Fold 3: Root Mean Squared Error = 2196.584 Fold 4: Root Mean Squared Error = 2534.717 Fold 5: Root Mean Squared Error = 2688.103 Fold 1: RMSE = 2410.52 Fold 2: RMSE = 2693.18 Fold 3: RMSE = 2247.91 Fold 4: RMSE = 2581.36 Fold 5: RMSE = 2732.40 Overall CV RMSE: 2541.37
=== OLS Regression === # A tibble: 2 x 6 .metric .estimator mean n std_err .config <chr> <chr> <dbl> <int> <dbl> <chr> 1 rmse standard 3.46 5 0.485 Preprocessor1_Model1 2 rsq standard 0.773 5 0.0631 Preprocessor1_Model1 === Random Forest === # A tibble: 2 x 6 .metric .estimator mean n std_err .config <chr> <chr> <dbl> <int> <dbl> <chr> 1 rmse standard 2.87 5 0.398 Preprocessor1_Model1 2 rsq standard 0.843 5 0.0412 Preprocessor1_Model1
Regression Metrics
Once you have a held-out test set (or cross-validation folds), you need a metric to quantify how well the model's predictions match the actual values. For regression problems -- where the outcome is a continuous variable like income, GDP, or house prices -- the most commonly used metrics are RMSE, MAE, R-squared, and MAPE. Each measures a different aspect of prediction quality, and the right choice depends on your application.
What are regression metrics for?
Regression metrics answer one question: how far off are my predictions from reality? If you predict that a house is worth $300,000 and it sells for $350,000, your error is $50,000. Regression metrics aggregate these individual errors across all test observations into a single number that summarizes the model's overall prediction quality. Different metrics emphasize different things -- some penalize big mistakes more, some are easier to interpret, and some allow comparison across different scales.
Root Mean Squared Error (RMSE)
RMSE is the most widely used regression metric. It computes the square root of the average squared prediction error:
RMSE is in the same units as the outcome variable -- if you are predicting house prices in dollars, the RMSE is also in dollars. This makes it directly interpretable: an RMSE of $25,000 means the model's predictions are typically about $25,000 away from the true values. Because it squares the errors before averaging, RMSE penalizes large errors disproportionately. A single prediction that is off by $100,000 contributes more to the RMSE than ten predictions that are each off by $10,000. This is often desirable -- in many applications, big mistakes are much worse than small ones.
Mean Absolute Error (MAE)
MAE takes a simpler approach -- it averages the absolute prediction errors without squaring:
MAE is also in the same units as the outcome and is easier to interpret: it is literally the average size of the prediction errors. Unlike RMSE, MAE treats all errors equally -- a $100,000 error contributes exactly 10 times as much as a $10,000 error. MAE is more robust to outliers than RMSE. If your data has a few extreme values that are hard to predict, RMSE will be dominated by those outliers, while MAE gives you a better sense of typical prediction quality.
R-squared (Coefficient of Determination)
R-squared measures the proportion of variance in the outcome that is explained by the model's predictions:
R-squared is unitless and ranges from negative infinity to 1. An R-squared of 1.0 means perfect prediction. An R-squared of 0.0 means the model does no better than simply predicting the mean of y for every observation. An R-squared below zero means the model is worse than predicting the mean -- this can happen with out-of-sample evaluation when the model has overfit badly. The key advantage of R-squared is that it is scale-free -- you can compare R-squared values across different outcome variables (e.g., predicting income vs. predicting GDP growth) in a way that RMSE does not allow.
Mean Absolute Percentage Error (MAPE)
MAPE expresses errors as a percentage of the actual values:
MAPE is intuitive -- a MAPE of 8% means predictions are off by about 8% on average. However, MAPE has a serious limitation: it is undefined when y_i = 0 and becomes extremely large when y_i is close to zero. For this reason, MAPE works best when the outcome variable is strictly positive and does not have values near zero (e.g., GDP, house prices, firm revenue).
When to Use RMSE
- Large errors are disproportionately costly
- You want a metric in the same units as the outcome
- Standard default for most ML applications
- Example: predicting house prices where a $200k miss is far worse than four $50k misses
When to Use MAE
- All error sizes are equally important
- Data contains outliers or extreme values
- You want a robust, easily interpretable metric
- Example: predicting delivery times where all delays matter equally
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score,
mean_absolute_percentage_error
)
# Load data and split
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train a random forest
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Generate predictions on the TEST set
y_pred = rf.predict(X_test)
# Compute all regression metrics
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
print("Regression Metrics (Test Set)")
print("=" * 40)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
print(f"MAPE: {mape:.2%}")
print(f"\nInterpretation:")
print(f" Predictions are off by ~${mae * 100000:,.0f} on average (MAE)")
print(f" The model explains {r2:.1%} of variance in house values")
* Stata: Computing regression metrics manually on the test set
sysuse auto, clear
set seed 42
* Create train/test split
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Fit model on training data
quietly reg price mpg weight length if is_test == 0
predict double yhat
* Compute metrics on TEST set only
gen double error = price - yhat if is_test == 1
gen double error_sq = error^2 if is_test == 1
gen double error_ab = abs(error) if is_test == 1
gen double pct_err = abs(error / price) if is_test == 1
* RMSE
quietly summarize error_sq
scalar rmse = sqrt(r(mean))
* MAE
quietly summarize error_ab
scalar mae = r(mean)
* R-squared (out-of-sample)
quietly summarize price if is_test == 1
scalar ybar = r(mean)
gen double ss_tot = (price - ybar)^2 if is_test == 1
quietly summarize ss_tot
scalar ss_total = r(sum)
quietly summarize error_sq
scalar ss_res = r(sum)
scalar r2_oos = 1 - ss_res / ss_total
* MAPE
quietly summarize pct_err
scalar mape = r(mean) * 100
* Display results
display "Regression Metrics (Test Set)"
display "========================================"
display "RMSE: " %9.2f rmse
display "MAE: " %9.2f mae
display "R-squared: " %9.4f r2_oos
display "MAPE: " %9.2f mape "%"
# R: Computing regression metrics with yardstick
library(rsample)
library(randomForest)
library(yardstick)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
# Train a random forest on the training set
rf_model <- randomForest(mpg ~ ., data = train, ntree = 200)
# Predict on the TEST set
test$predicted <- predict(rf_model, newdata = test)
# Compute metrics manually
results <- test |>
summarise(
RMSE = sqrt(mean((mpg - predicted)^2)),
MAE = mean(abs(mpg - predicted)),
R2 = 1 - sum((mpg - predicted)^2) / sum((mpg - mean(mpg))^2),
MAPE = mean(abs((mpg - predicted) / mpg)) * 100
)
# Display results
cat("Regression Metrics (Test Set)\n")
cat("========================================\n")
cat(sprintf("RMSE: %.4f\n", results$RMSE))
cat(sprintf("MAE: %.4f\n", results$MAE))
cat(sprintf("R-squared: %.4f\n", results$R2))
cat(sprintf("MAPE: %.2f%%\n", results$MAPE))
Regression Metrics (Test Set) ======================================== RMSE: 0.5010 MAE: 0.3272 R-squared: 0.8098 MAPE: 17.25% Interpretation: Predictions are off by ~$32,720 on average (MAE) The model explains 81.0% of variance in house values
Regression Metrics (Test Set) ======================================== RMSE: 2461.38 MAE: 1894.52 R-squared: 0.3271 MAPE: 29.87%
Regression Metrics (Test Set) ======================================== RMSE: 2.7134 MAE: 2.1026 R-squared: 0.8523 MAPE: 10.84%
Classification Metrics
When the outcome is categorical -- whether a firm defaults, whether a household participates in a program, whether an email is spam -- the evaluation metrics are fundamentally different from regression. Prediction errors are no longer continuous deviations; they are discrete mistakes: the model said "yes" but the truth was "no," or vice versa. Classification metrics quantify these discrete errors in different ways, and choosing the right metric is critical because they often disagree about which model is "best."
What are classification metrics for?
Classification metrics answer: how often does the model get the category right, and what kind of mistakes does it make? Unlike regression where errors are "how far off," classification errors come in distinct flavors -- and the cost of each flavor depends entirely on the application. Flagging a legitimate email as spam (false positive) is annoying; missing a fraudulent transaction (false negative) can cost thousands of dollars. The metrics below let you measure and compare these different types of errors.
Economics Example: The Accuracy Trap
When predicting loan defaults (suppose only 1% of loans default), accuracy is useless. A model that predicts "no default" for every single loan achieves 99% accuracy while being completely useless for the task -- it never identifies any defaults. This is why you need precision, recall, and F1: they measure how well the model handles the rare but important class. In economics, the interesting class is almost always the rare one: firm bankruptcy, program take-up, policy violations, financial crises.
The foundation of all classification metrics is the confusion matrix. For a binary classification problem, it is a 2x2 table with four cells:
Confusion Matrix
Green cells = correct predictions. Red cells = errors. Every classification metric is computed from these four counts.
Accuracy
Accuracy is the simplest metric: the fraction of all predictions that are correct.
Accuracy is intuitive but deeply misleading with imbalanced classes. As shown in the loan default example above, a model that always predicts the majority class can achieve very high accuracy while being completely useless. For this reason, accuracy should almost never be used as the primary metric for imbalanced classification problems -- and in economics, most interesting classification tasks are imbalanced.
Precision and Recall
The Spam Filter Analogy
Precision vs. recall is like a spam filter: high precision means almost everything the filter flags IS actually spam, but you might miss some spam in your inbox. High recall means the filter catches almost ALL spam, but some real emails get flagged too. You cannot maximize both at once -- there is always a tradeoff, and the right balance depends on how costly each type of mistake is.
Precision answers: "Of all observations the model labeled as positive, how many actually were positive?" It measures the reliability of positive predictions.
Recall (also called sensitivity or true positive rate) answers: "Of all observations that were actually positive, how many did the model identify?" It measures the completeness of positive detection.
There is an inherent trade-off between precision and recall. Lowering the classification threshold (being more willing to predict "positive") increases recall but decreases precision, and vice versa. The right balance depends on the costs of each type of error.
Precision-Recall Tradeoff
As you raise the threshold, precision increases (fewer false alarms) but recall decreases (more misses). The F1 score balances the two.
F1 Score
The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both:
F1 ranges from 0 to 1. It equals 1 only when both precision and recall are perfect. The harmonic mean penalizes imbalance: if either precision or recall is very low, F1 will be low even if the other is high. This makes F1 a better summary metric than accuracy for imbalanced problems.
Precision-Focused Problems
- Spam filtering (users hate false alarms)
- Drug approval (false positives have high costs)
- Targeting scarce program resources
- When acting on a prediction is expensive
Recall-Focused Problems
- Fraud detection (missing fraud is very costly)
- Disease screening (missing cases is dangerous)
- Identifying at-risk students or firms
- When missing a positive case is the worst outcome
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
accuracy_score
)
# Load breast cancer data (binary classification)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42,
stratify=data.target
)
# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)
# Hard predictions and predicted probabilities
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1] # Probability of positive class
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f" TN={cm[0,0]} FP={cm[0,1]}")
print(f" FN={cm[1,0]} TP={cm[1,1]}")
# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=["Malignant", "Benign"]))
# AUC-ROC (uses probabilities, not hard labels)
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")
* Stata: Classification metrics after logistic regression
webuse lbw, clear
* Binary outcome: low birth weight (low = 1 if birth weight < 2500g)
set seed 42
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Fit logistic regression on training data
logit low age lwt i.race smoke ptl ht ui if is_test == 0
* Predict probabilities for all observations
predict double phat, pr
* Classify using 0.5 threshold
gen byte y_pred = (phat >= 0.5)
* Confusion matrix on test set
display "Confusion Matrix (Test Set):"
tab low y_pred if is_test == 1
* estat classification gives sensitivity, specificity, etc.
quietly logit low age lwt i.race smoke ptl ht ui if is_test == 0
estat classification if is_test == 1
* ROC curve and AUC
lroc if is_test == 1, nograph
# R: Classification metrics using yardstick
library(rsample)
library(parsnip)
library(yardstick)
# Load data (built-in two-class dataset)
data(two_class_example, package = "yardstick")
# Compute confusion matrix
cat("Confusion Matrix:\n")
print(conf_mat(two_class_example, truth = truth, estimate = predicted))
# Compute all key classification metrics
metrics_result <- two_class_example |>
summarise(
Accuracy = accuracy_vec(truth, predicted),
Precision = precision_vec(truth, predicted),
Recall = recall_vec(truth, predicted),
F1 = f_meas_vec(truth, predicted)
)
cat("\nClassification Metrics:\n")
cat(sprintf(" Accuracy: %.4f\n", metrics_result$Accuracy))
cat(sprintf(" Precision: %.4f\n", metrics_result$Precision))
cat(sprintf(" Recall: %.4f\n", metrics_result$Recall))
cat(sprintf(" F1 Score: %.4f\n", metrics_result$F1))
# AUC-ROC (requires predicted probabilities)
auc_val <- roc_auc_vec(
two_class_example$truth,
two_class_example$Class1
)
cat(sprintf(" AUC-ROC: %.4f\n", auc_val))
Confusion Matrix:
TN=40 FP=3
FN=2 TP=69
Classification Report:
precision recall f1-score support
Malignant 0.95 0.93 0.94 43
Benign 0.96 0.97 0.97 71
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
AUC-ROC: 0.9952Confusion Matrix (Test Set):
| y_pred
low | 0 1 | Total
-----------+----------------------+----------
0 | 23 4 | 27
1 | 5 6 | 11
-----------+----------------------+----------
Total | 28 10 | 38
Logistic model for low
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 6 4 | 10
- | 5 23 | 28
-----------+--------------------------+-----------
Total | 11 27 | 38
Sensitivity Pr( +| D) 54.55%
Specificity Pr( -|~D) 85.19%
Positive predictive value Pr( D| +) 60.00%
Negative predictive value Pr(~D| -) 82.14%
Correctly classified 76.32%
Area under ROC curve = 0.7828Confusion Matrix:
Truth
Prediction Class1 Class2
Class1 227 27
Class2 23 223
Classification Metrics:
Accuracy: 0.9000
Precision: 0.8937
Recall: 0.9080
F1 Score: 0.9008
AUC-ROC: 0.9406AUC-ROC and Threshold-Independent Evaluation
The Card-Sorting Analogy
AUC-ROC is like measuring how good you are at sorting a shuffled deck of red and black cards back into the right piles. If you are perfect, every red card gets placed before every black card (AUC = 1.0). If you sort randomly, you do no better than chance (AUC = 0.5). The AUC measures your ranking ability -- whether you can consistently place the positives ahead of the negatives -- regardless of where you draw the cutoff line between the two piles.
The AUC-ROC evaluates the model's ability to discriminate between positive and negative classes across all possible thresholds, rather than at a single threshold. The ROC curve plots the true positive rate (recall) against the false positive rate at every possible classification threshold. The area under this curve (AUC) summarizes overall discrimination ability.
AUC-ROC is particularly useful when you have not yet decided on a classification threshold, or when you want to compare models independently of threshold choice. A model with a higher AUC is better at ranking positive cases above negative cases, regardless of where you eventually draw the line.
ROC Curves: Good Model vs. Random vs. Perfect
The further the curve bows toward the top-left corner, the better the model. The shaded area is the AUC -- the probability that a randomly chosen positive case is ranked above a randomly chosen negative case.
Interpreting AUC Values
- AUC = 0.5: No better than flipping a coin. The model has no discrimination ability.
- AUC = 0.7-0.8: Acceptable discrimination. Common for many real-world problems.
- AUC = 0.8-0.9: Good discrimination. The model reliably ranks positives above negatives.
- AUC > 0.9: Excellent discrimination. Rare in noisy social science data -- check for data leakage if you see this unexpectedly.
Hyperparameter Tuning
Most machine learning models have hyperparameters -- settings that must be chosen before training begins and that can dramatically affect performance. The number of trees in a random forest, the regularization penalty in LASSO, the depth of a decision tree, the learning rate of a neural network: none of these are learned from the data, and all of them matter.
The naive approach -- trying a few values by hand and picking whatever looks best on the test set -- is statistically dangerous. If you try 100 hyperparameter combinations and select the one with the best test-set performance, you have effectively used the test set for model selection, and your reported performance is optimistically biased. The solution is to use a validation set (separate from both training and test) or, better, to use cross-validation within the training set to evaluate hyperparameter choices, reserving the test set for final evaluation only.
Grid Search
Grid search evaluates every combination of a predefined set of hyperparameter values. If you specify 5 values for the regularization parameter and 4 values for the tree depth, grid search trains and evaluates 5 x 4 = 20 models. It is simple, exhaustive, and parallelizable, but becomes prohibitively expensive when the hyperparameter space is large.
Random Search
Random search samples hyperparameter combinations randomly from specified distributions. It is often more efficient than grid search because it explores the space more broadly. A classic result by Bergstra and Bengio (2012) showed that random search finds good hyperparameters in fewer evaluations than grid search, especially when some hyperparameters matter much more than others.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Load and split data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Define the hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_leaf': [1, 5, 10]
}
# Grid search with 5-fold CV (3 x 3 x 3 = 27 combinations)
grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
# Results
print(f"\nBest hyperparameters:")
for param, val in grid_search.best_params_.items():
print(f" {param}: {val}")
print(f"\nBest CV RMSE: {(-grid_search.best_score_)**0.5:.4f}")
# Evaluate best model on held-out TEST set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Test RMSE (final, unbiased): {test_rmse:.4f}")
* Stata: Manual hyperparameter tuning via cross-validation loop
* Stata does not have a built-in GridSearchCV equivalent,
* so we loop over candidate specifications manually.
sysuse auto, clear
set seed 42
* Create 5 folds for cross-validation
gen fold = ceil(5 * runiform())
* Track best specification
local best_rmse = 99999
local best_spec = ""
* Loop over candidate model specifications
foreach spec in "mpg weight" "mpg weight length" "mpg weight length foreign" "mpg weight length foreign headroom trunk" {
local total_se = 0
local n_test = 0
forvalues k = 1/5 {
quietly reg price `spec' if fold != `k'
quietly predict double yhat_`k' if fold == `k'
quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
quietly summarize se_`k'
local total_se = `total_se' + r(sum)
local n_test = `n_test' + r(N)
drop yhat_`k' se_`k'
}
local cv_rmse = sqrt(`total_se' / `n_test')
display "Spec: `spec' => CV RMSE = " %9.2f `cv_rmse'
if `cv_rmse' < `best_rmse' {
local best_rmse = `cv_rmse'
local best_spec = "`spec'"
}
}
display _newline "Best specification: `best_spec'"
display "Best CV RMSE: " %9.2f `best_rmse'
# R: Hyperparameter tuning with tidymodels tune_grid()
library(tidymodels)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train, v = 5)
# Define a random forest model with tunable hyperparameters
rf_spec <- rand_forest(
trees = tune(),
min_n = tune()
) |>
set_engine("ranger") |>
set_mode("regression")
# Create a workflow
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)
# Define a hyperparameter grid
rf_grid <- grid_regular(
trees(range = c(50, 300)),
min_n(range = c(2, 10)),
levels = 4
)
# Tune using 5-fold CV
rf_tuned <- tune_grid(
rf_wf,
resamples = folds,
grid = rf_grid,
metrics = metric_set(rmse, rsq)
)
# Show best results
cat("Top 5 configurations by RMSE:\n")
print(show_best(rf_tuned, metric = "rmse", n = 5))
# Finalize workflow with best hyperparameters
best_params <- select_best(rf_tuned, metric = "rmse")
final_wf <- finalize_workflow(rf_wf, best_params)
final_fit <- last_fit(final_wf, split)
cat("\nFinal test set metrics:\n")
print(collect_metrics(final_fit))
Fitting 5 folds for each of 27 candidates, totalling 135 fits Best hyperparameters: max_depth: 20 min_samples_leaf: 1 n_estimators: 200 Best CV RMSE: 0.5027 Test RMSE (final, unbiased): 0.4986
Spec: mpg weight => CV RMSE = 2648.93 Spec: mpg weight length => CV RMSE = 2541.37 Spec: mpg weight length foreign => CV RMSE = 2486.12 Spec: mpg weight length foreign headroom trunk => CV RMSE = 2592.81 Best specification: mpg weight length foreign Best CV RMSE: 2486.12
Top 5 configurations by RMSE: # A tibble: 5 x 8 trees min_n .metric .estimator mean n std_err .config <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> 1 216 2 rmse standard 2.68 5 0.412 Preprocessor1_Model06 2 300 2 rmse standard 2.71 5 0.398 Preprocessor1_Model08 3 133 2 rmse standard 2.73 5 0.425 Preprocessor1_Model04 4 216 5 rmse standard 2.79 5 0.381 Preprocessor1_Model07 5 300 5 rmse standard 2.82 5 0.393 Preprocessor1_Model09 Final test set metrics: # A tibble: 2 x 4 .metric .estimator .estimate .config <chr> <chr> <dbl> <chr> 1 rmse standard 2.54 Preprocessor1_Model1 2 rsq standard 0.867 Preprocessor1_Model1
Overfitting Diagnostics
How can you tell whether a model is underfitting, overfitting, or well-calibrated? The most informative diagnostic tools are learning curves and validation curves. These plots show you not just the final performance number, but how performance changes as you vary the training set size or a hyperparameter. They make the bias-variance trade-off visible.
What are learning curves for?
Learning curves answer a critical practical question: should I collect more data, or do I need a different model? If your model is underfitting (both training and validation error are high), more data will not help -- you need a more flexible model. If your model is overfitting (training error is low but validation error is high), more data will help, or you can add regularization. Learning curves make this diagnosis visual and concrete.
Learning Curves
A learning curve plots both training-set and validation-set performance as the size of the training set increases. The patterns are diagnostic:
- Underfitting: Both training and validation scores are low and converge quickly. Adding more data will not help -- the model is too simple. You need a more flexible model.
- Overfitting: Training score is high, but validation score is significantly lower. The gap between the two curves is large. Adding more data or regularization can help.
- Good fit: Training score is high, validation score is close to it, and both stabilize as the training set grows.
Learning Curve Patterns
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import learning_curve
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Compute learning curve
train_sizes, train_scores, val_scores = learning_curve(
RandomForestRegressor(n_estimators=100, random_state=42),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)
# Convert to RMSE (scores are negative MSE)
train_rmse = np.sqrt(-train_scores)
val_rmse = np.sqrt(-val_scores)
# Print numerical summary
print("Learning Curve Summary")
print("=" * 55)
print(f"{'Train Size':>12} {'Train RMSE':>12} {'Val RMSE':>12}")
print("-" * 55)
for i in range(len(train_sizes)):
print(f"{train_sizes[i]:>12,} {train_rmse.mean(axis=1)[i]:>12.4f} {val_rmse.mean(axis=1)[i]:>12.4f}")
# Optional: plot with matplotlib
# import matplotlib.pyplot as plt
# plt.plot(train_sizes, train_rmse.mean(axis=1), 'o-', label='Training RMSE')
# plt.plot(train_sizes, val_rmse.mean(axis=1), 'o-', label='Validation RMSE')
# plt.xlabel('Training Set Size'); plt.ylabel('RMSE')
# plt.legend(); plt.show()
* Stata: Manual learning curve computation
* Train on increasing fractions of data, evaluate on held-out test set
sysuse auto, clear
set seed 42
* Reserve 20% as permanent test set
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Count training observations
quietly count if is_test == 0
local n_train = r(N)
* Generate a random ordering for training observations
gen train_order = runiform() if is_test == 0
sort is_test train_order
* Loop over increasing training set fractions
display "Learning Curve: RMSE by Training Set Size"
display "==========================================="
display "{txt} Train Size Train RMSE Test RMSE"
display "-------------------------------------------"
foreach frac in 0.2 0.4 0.6 0.8 1.0 {
local n_use = floor(`frac' * `n_train')
* Mark which training obs to use
gen byte use_train = (_n <= `n_use') & (is_test == 0)
* Fit model on subset
quietly reg price mpg weight length if use_train == 1
quietly predict double yhat
* Train RMSE
quietly gen double se_train = (price - yhat)^2 if use_train == 1
quietly summarize se_train
local rmse_train = sqrt(r(mean))
* Test RMSE
quietly gen double se_test = (price - yhat)^2 if is_test == 1
quietly summarize se_test
local rmse_test = sqrt(r(mean))
display %12.0f `n_use' %12.2f `rmse_train' %12.2f `rmse_test'
drop use_train yhat se_train se_test
}
# R: Custom learning curve function
library(randomForest)
library(rsample)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
# Define training set fractions
fractions <- seq(0.3, 1.0, by = 0.1)
results <- data.frame(frac = fractions,
n = integer(length(fractions)),
train_rmse = numeric(length(fractions)),
test_rmse = numeric(length(fractions)))
for (i in seq_along(fractions)) {
n_use <- floor(fractions[i] * nrow(train))
idx <- sample(nrow(train), n_use)
train_sub <- train[idx, ]
# Train model on subset
rf <- randomForest(mpg ~ ., data = train_sub, ntree = 100)
# Train RMSE
train_pred <- predict(rf, train_sub)
results$train_rmse[i] <- sqrt(mean((train_sub$mpg - train_pred)^2))
# Test RMSE
test_pred <- predict(rf, test)
results$test_rmse[i] <- sqrt(mean((test$mpg - test_pred)^2))
results$n[i] <- n_use
}
# Print results
cat("Learning Curve Summary\n")
cat("==========================================\n")
cat(sprintf("%12s %12s %12s\n", "Train Size", "Train RMSE", "Test RMSE"))
cat("------------------------------------------\n")
for (i in 1:nrow(results)) {
cat(sprintf("%12d %12.4f %12.4f\n",
results$n[i], results$train_rmse[i], results$test_rmse[i]))
}
Learning Curve Summary
=======================================================
Train Size Train RMSE Val RMSE
-------------------------------------------------------
1,651 0.1834 0.5823
3,302 0.1921 0.5547
4,953 0.1956 0.5388
6,604 0.1963 0.5285
8,256 0.1976 0.5218
9,907 0.1987 0.5166
11,558 0.1992 0.5121
13,209 0.1998 0.5092
14,860 0.2001 0.5068
16,512 0.2005 0.5047Learning Curve: RMSE by Training Set Size
===========================================
Train Size Train RMSE Test RMSE
-------------------------------------------
12 1847.35 3142.68
23 2012.48 2784.51
35 2098.63 2621.34
47 2134.17 2512.89
59 2156.42 2461.38Learning Curve Summary
==========================================
Train Size Train RMSE Test RMSE
------------------------------------------
7 1.2341 4.8712
10 1.3028 3.9456
12 1.3894 3.5218
15 1.4216 3.2847
17 1.4587 3.1025
20 1.4812 2.9483
22 1.5034 2.8156
25 1.5198 2.7134Data Leakage
The Test-Answers Analogy
Data leakage is like accidentally seeing the test answers before the exam -- your score looks great, but you have not actually learned anything. When you deploy the model in the real world (where the "answers" are not available), performance collapses. Leakage is the #1 mistake that invalidates ML results, and it can be extremely subtle.
Data leakage occurs when information from the test set (or from the future, or from the outcome itself) sneaks into the training process. It is the single most common and most dangerous mistake in applied machine learning. Models with leakage produce spectacular evaluation metrics that evaporate when the model meets real data.
Common Sources of Leakage
1. Preprocessing on Full Data
If you normalize features (subtract mean, divide by standard deviation) using statistics computed on the entire dataset -- including test observations -- the model has indirectly "seen" test data. The fix: compute all preprocessing statistics (mean, std, min, max) on the training set only, then apply those same transformations to the test set. In scikit-learn, this means calling scaler.fit(X_train) and then scaler.transform(X_test) -- never scaler.fit(X) on the full dataset.
2. Future Information in Features
If you use variables that would not be available at prediction time, you have leakage. For example, using next quarter's GDP to predict this quarter's unemployment, or using a "days until bankruptcy" variable to predict whether a firm will go bankrupt. Any feature that is a consequence of the outcome rather than a cause or predictor is a leakage risk. Always ask: "Would I have this variable available at the time I need to make the prediction?"
3. Duplicates Across Train/Test
If the same observation appears in both the training and test set (due to duplicated rows, overlapping time windows, or group-level correlation), the model can memorize it during training and "predict" it perfectly during testing. This is especially common with panel data where the same individual appears in multiple rows -- if you split randomly, you may train on person X in 2019 and test on person X in 2020, which is not a true out-of-sample test.
How to Detect Leakage
The clearest warning sign is suspiciously good performance. If your model achieves an AUC of 0.99 on a messy social science prediction task, something is probably wrong. Other red flags:
- Test performance is much higher than you expected based on the literature or the difficulty of the problem
- Performance drops dramatically when you switch to a truly temporal split or a new dataset
- The most important features (from feature importance) are variables that seem to "encode" the outcome rather than predict it
- Adding more data does not change performance at all -- the model has already memorized the relationship
Model Selection and Comparison
In practice, model evaluation is not about computing a single metric for a single model. It is a systematic process of comparing multiple candidate models to select the best one for your task. A reliable model comparison workflow follows these steps:
- Split your data into training and test sets (or use nested cross-validation for small datasets). The test set is locked away until the very end.
- Define candidate models: OLS, LASSO, random forest, gradient boosting, neural network, etc. Include simple baselines (e.g., "always predict the mean") to ensure your model adds value.
- Tune hyperparameters for each candidate using cross-validation on the training set only. Each model gets its best hyperparameters.
- Compare cross-validated performance across models. Use the same folds for all models to ensure a fair comparison.
- Select the best model and evaluate it once on the test set. This is your final, unbiased performance estimate. Do not go back and try more models after seeing the test-set result.
Prediction Task Evaluation
- Goal: maximize out-of-sample prediction accuracy
- Metrics: RMSE, MAE, AUC-ROC, F1
- Method: cross-validation + held-out test set
- Flexibility is a virtue if it improves predictions
- Examples: forecasting, targeting, classification
Causal Task Evaluation
- Goal: estimate a causal effect (ATE, CATE, ATT)
- Metrics: bias, coverage of confidence intervals, MSE of the treatment effect estimate
- Method: simulation studies, placebo tests, sensitivity analysis
- Interpretability and identification matter more than raw fit
- Examples: policy evaluation, treatment effects, DiD
The distinction above is fundamental. The entire model evaluation framework in this module -- cross-validation, RMSE, AUC-ROC, hyperparameter tuning -- is designed for prediction tasks. When the goal is causal inference, the evaluation criteria are different: you care about whether the estimated treatment effect is unbiased and whether the confidence intervals have correct coverage, not about whether the model predicts the outcome well. A model can be an excellent predictor but a terrible estimator of causal effects, and vice versa. See Module 11D (Causal ML) for evaluation strategies specific to causal tasks.
References
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. (Available free at hastie.su.domains/ElemStatLearn)
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. (Available free at statlearning.com)
- Athey, S., & Imbens, G. W. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics, 11, 685-725.
- Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.
- Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.