11E  Model Evaluation

~2 hours Validation Strategies, Regression Metrics, Classification Metrics, Model Selection

Learning Objectives

  • Understand why evaluating on training data is misleading and how overfitting distorts performance estimates
  • Implement train-test splits and K-fold cross-validation in Python, Stata, and R
  • Compute and interpret regression metrics: RMSE, MAE, R-squared, and MAPE
  • Compute and interpret classification metrics: accuracy, precision, recall, F1, and AUC-ROC
  • Use proper evaluation strategies to compare and select among competing models

Why Evaluation Matters

Every model you have encountered in this course — from OLS regression to LASSO, random forests, and neural networks — needs to be evaluated. But evaluated on what? The answer to this question is one of the most important ideas in all of machine learning, and it separates rigorous empirical work from misleading results.

The central problem is overfitting. A sufficiently flexible model — a deep neural network, a random forest with many trees, or even an OLS regression with hundreds of interaction terms — can always achieve excellent performance on the data it was trained on. It does this by memorizing the noise, the outliers, and the quirks of that specific dataset. The model's apparent "accuracy" is an illusion: it reflects the model's ability to recall the training data, not its ability to predict new, unseen observations.

Consider a simple example. Suppose you want to predict GDP growth using 20 macroeconomic indicators, and your dataset contains 50 country-year observations. If you fit a polynomial regression with enough interaction terms, you can achieve an R-squared of 1.0 on the training data — a "perfect" fit. But this model has essentially memorized your 50 data points. Give it data from a new year or a new country, and its predictions will be wildly inaccurate. The training-set R-squared of 1.0 told you nothing useful about the model's real predictive ability.

The Cardinal Rule of Model Evaluation

Never evaluate a model on the same data used to train it. Training-set performance is always optimistically biased. The more flexible the model, the larger the gap between training performance and true out-of-sample performance. This is not a minor technicality — it is the difference between a model that works and one that only appears to work.

The solution is conceptually simple: set aside data that the model has never seen during training, and evaluate the model's predictions on that held-out data. The specific strategies for doing this — train-test splits and cross-validation — are the subject of the next two sections. But the underlying principle is always the same: the only honest measure of a model's quality is its performance on data it has not been trained on.

This principle matters just as much in economics as in computer science. If you are building a model to predict firm bankruptcy, forecast inflation, or identify households eligible for a targeted policy, you need to know how well it will perform when deployed on new data. Reporting only training-set performance is analogous to judging a student by giving them the exact same exam they used to study — it tells you about memorization, not understanding.

Train-Test Split

The simplest evaluation strategy is to split your data into two parts: a training set used to fit the model, and a test set (sometimes called a "hold-out set") used to evaluate it. The model never sees the test data during training — it is kept completely separate, like sealing an exam in an envelope until the student has finished studying.

A typical split is 80% training and 20% test, though the exact ratio depends on the size of your dataset. With very large datasets (hundreds of thousands of observations), even a 90/10 split leaves plenty of test data. With small datasets (a few hundred observations), you may need cross-validation instead (covered in the next section), because holding out 20% leaves too few observations for both reliable training and reliable evaluation.

Two important implementation details. First, the split should be random — if your data is sorted (by time, by country, by outcome), a non-random split would systematically exclude certain types of observations from training or testing, leading to biased results. Second, you should set a random seed so that your results are reproducible. Anyone running your code should get the same split and therefore the same results.

When Random Splits Are Wrong

If your data has a time dimension (panel data, time series), a random split can leak future information into the training set. For example, if you randomly split yearly GDP data, the model might train on 2020 data and predict 2018 — it has seen the future. In these cases, use a temporal split: train on earlier periods and test on later ones. This mimics the real prediction task, where you only have past data available.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load California Housing data
housing = fetch_california_housing(as_frame=True)
X = housing.data       # 20,640 observations, 8 features
y = housing.target     # Median house value (in $100k)

# Split: 80% training, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Full dataset:    {X.shape[0]:,} observations")
print(f"Training set:    {X_train.shape[0]:,} observations ({X_train.shape[0]/X.shape[0]:.0%})")
print(f"Test set:        {X_test.shape[0]:,} observations ({X_test.shape[0]/X.shape[0]:.0%})")
print(f"\nFeatures:        {X.shape[1]}")
print(f"Target mean:     {y.mean():.3f} (train: {y_train.mean():.3f}, test: {y_test.mean():.3f})")
* Stata: Train-test split using random number generation
* Load example data
sysuse auto, clear

* Set seed for reproducibility
set seed 42

* Generate a uniform random variable and split at 80th percentile
gen rand = runiform()
gen byte split = cond(rand <= 0.8, 1, 0)
label define splitlbl 1 "Train" 0 "Test"
label values split splitlbl

* Verify the split
tab split

* Summarize the outcome in each split
tabstat price, by(split) stat(n mean sd min max) format(%9.1f)

* Fit model on training data only
reg price mpg weight length if split == 1

* Generate predictions for ALL observations (train + test)
predict yhat

* Evaluate on TEST set only
gen resid_sq = (price - yhat)^2 if split == 0
quietly summarize resid_sq
display "Test RMSE: " sqrt(r(mean))
# R: Train-test split using rsample (tidymodels framework)
library(rsample)

# Load built-in mtcars data (32 observations, 10 features)
data(mtcars)

# Create an 80/20 split (stratified splits are also possible)
set.seed(42)
split_obj <- initial_split(mtcars, prop = 0.8)

# Extract training and test sets
train_data <- training(split_obj)
test_data  <- testing(split_obj)

cat("Full dataset:   ", nrow(mtcars), "observations\n")
cat("Training set:   ", nrow(train_data), "observations",
    sprintf("(%0.0f%%)\n", 100 * nrow(train_data) / nrow(mtcars)))
cat("Test set:       ", nrow(test_data), "observations",
    sprintf("(%0.0f%%)\n", 100 * nrow(test_data) / nrow(mtcars)))
cat("\nTarget (mpg) mean — train:", round(mean(train_data$mpg), 2),
    "  test:", round(mean(test_data$mpg), 2), "\n")
Python Output
Full dataset:    20,640 observations
Training set:    16,512 observations (80%)
Test set:        4,128 observations (20%)

Features:        8
Target mean:     2.069 (train: 2.072, test: 2.055)
Stata Output
     split |      Freq.     Percent        Cum.
-----------+-----------------------------------
     Train |         59       79.73       79.73
      Test |         15       20.27      100.00
-----------+-----------------------------------
     Total |         74      100.00

      split |         N      Mean       S.D.       Min       Max
  ----------+-----------------------------------------------------
      Train |        59    6165.3    2949.5     3291    15906
       Test |        15    6229.9    3217.1     3748    14500
  ----------+-----------------------------------------------------
      Total |        74    6178.4    2993.0     3291    15906

Test RMSE: 2461.38
R Output
Full dataset:    32 observations
Training set:    25 observations (78%)
Test set:        7 observations (22%)

Target (mpg) mean — train: 19.96   test: 20.93

K-Fold Cross-Validation

A single train-test split has an obvious limitation: the evaluation depends on which observations happened to land in the test set. With a small dataset, a different random split could give substantially different results. You might get lucky (easy test observations) or unlucky (hard ones), and your performance estimate will be noisy.

K-fold cross-validation solves this by using every observation for both training and testing, just not at the same time. The procedure is straightforward: divide the data into K equally-sized subsets (called "folds"). Then, for each fold in turn, train the model on the other K-1 folds and evaluate it on the held-out fold. After K rounds, every observation has been in the test set exactly once. The final performance metric is the average across all K folds.

The most common choice is K = 5 or K = 10. With 5-fold CV, each fold contains 20% of the data, and the model is trained 5 times on 80% of the data. This gives a more reliable performance estimate than a single split, because it averages over 5 different train-test configurations. The cost is computational: you must train the model K times instead of once.

5-Fold Cross-Validation

Round
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Score
1
Test
Train
Train
Train
Train
0.831
2
Train
Test
Train
Train
Train
0.847
3
Train
Train
Test
Train
Train
0.819
4
Train
Train
Train
Test
Train
0.853
5
Train
Train
Train
Train
Test
0.838
Average R-squared across all 5 folds:
0.838

Each observation appears in the test set exactly once. The final score is the average across all folds.

An important variant is stratified K-fold cross-validation, which ensures that each fold has roughly the same distribution of the target variable. This is particularly important for classification problems with imbalanced classes — if only 5% of firms in your dataset go bankrupt, a non-stratified split might create a fold with no bankrupt firms at all, making evaluation meaningless.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Define two competing models
ols   = LinearRegression()
rf    = RandomForestRegressor(n_estimators=100, random_state=42)

# 5-fold cross-validation for both models
# scoring="r2" computes R-squared; scoring="neg_mean_squared_error" gives -MSE
ols_scores = cross_val_score(ols, X, y, cv=5, scoring="r2")
rf_scores  = cross_val_score(rf,  X, y, cv=5, scoring="r2")

print("5-Fold Cross-Validation: R-squared")
print("=" * 45)
print(f"OLS Regression:")
print(f"  Per-fold scores: {ols_scores.round(3)}")
print(f"  Mean: {ols_scores.mean():.3f}  (Std: {ols_scores.std():.3f})")
print(f"\nRandom Forest:")
print(f"  Per-fold scores: {rf_scores.round(3)}")
print(f"  Mean: {rf_scores.mean():.3f}  (Std: {rf_scores.std():.3f})")
* Stata: K-fold cross-validation using the crossfold command
* Install if needed: ssc install crossfold

sysuse auto, clear
set seed 42

* crossfold performs k-fold CV for regression models
* It reports the out-of-sample RMSE for each fold
crossfold reg price mpg weight length, k(5)

* The output shows RMSE for each fold and the overall average
* You can also do this manually for more control:

* Manual 5-fold CV
gen fold = ceil(5 * runiform())

forvalues k = 1/5 {
    quietly reg price mpg weight length if fold != `k'
    quietly predict double yhat_`k' if fold == `k'
    quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
    quietly summarize se_`k'
    display "Fold `k': RMSE = " sqrt(r(mean))
}

* Compute overall CV RMSE
egen all_se = rowtotal(se_1-se_5), missing
quietly summarize all_se
display "Overall CV RMSE: " sqrt(r(mean))
# R: K-fold cross-validation using tidymodels
library(rsample)
library(parsnip)
library(yardstick)
library(tune)
library(workflows)

# Load data
data(mtcars)

# Create 5-fold cross-validation object
set.seed(42)
folds <- vfold_cv(mtcars, v = 5)

# Define two models
lm_spec <- linear_reg() |> set_engine("lm")
rf_spec <- rand_forest(trees = 100) |>
  set_engine("ranger") |>
  set_mode("regression")

# Create workflows (model + formula)
lm_wf <- workflow() |> add_model(lm_spec) |> add_formula(mpg ~ .)
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)

# Fit both models using 5-fold CV
lm_results <- fit_resamples(lm_wf, resamples = folds, metrics = metric_set(rsq, rmse))
rf_results <- fit_resamples(rf_wf, resamples = folds, metrics = metric_set(rsq, rmse))

# Collect and display average metrics
cat("=== OLS Regression ===\n")
print(collect_metrics(lm_results))
cat("\n=== Random Forest ===\n")
print(collect_metrics(rf_results))
Python Output
5-Fold Cross-Validation: R-squared
=============================================
OLS Regression:
  Per-fold scores: [0.596 0.609 0.593 0.607 0.592]
  Mean: 0.599  (Std: 0.007)

Random Forest:
  Per-fold scores: [0.805 0.826 0.812 0.817 0.809]
  Mean: 0.814  (Std: 0.007)
Stata Output
Fold 1:  Root Mean Squared Error =  2372.019
Fold 2:  Root Mean Squared Error =  2781.445
Fold 3:  Root Mean Squared Error =  2196.584
Fold 4:  Root Mean Squared Error =  2534.717
Fold 5:  Root Mean Squared Error =  2688.103

Fold 1: RMSE = 2410.52
Fold 2: RMSE = 2693.18
Fold 3: RMSE = 2247.91
Fold 4: RMSE = 2581.36
Fold 5: RMSE = 2732.40
Overall CV RMSE: 2541.37
R Output
=== OLS Regression ===
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 rmse    standard   3.46      5  0.485  Preprocessor1_Model1
2 rsq     standard   0.773     5  0.0631 Preprocessor1_Model1

=== Random Forest ===
# A tibble: 2 x 6
  .metric .estimator  mean     n std_err .config
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 rmse    standard   2.87      5  0.398  Preprocessor1_Model1
2 rsq     standard   0.843     5  0.0412 Preprocessor1_Model1

Regression Metrics

Once you have a held-out test set (or cross-validation folds), you need a metric to quantify how well the model's predictions match the actual values. For regression problems — where the outcome is a continuous variable like income, GDP, or house prices — the most commonly used metrics are RMSE, MAE, R-squared, and MAPE. Each measures a different aspect of prediction quality, and the right choice depends on your application.

Root Mean Squared Error (RMSE)

RMSE is the most widely used regression metric. It computes the square root of the average squared prediction error:

RMSE = sqrt( (1/n) * sum( (y_i - y_hat_i)^2 ) )

RMSE is in the same units as the outcome variable — if you are predicting house prices in dollars, the RMSE is also in dollars. This makes it directly interpretable: an RMSE of $25,000 means the model's predictions are typically about $25,000 away from the true values. Because it squares the errors before averaging, RMSE penalizes large errors disproportionately. A single prediction that is off by $100,000 contributes more to the RMSE than ten predictions that are each off by $10,000. This is often desirable — in many applications, big mistakes are much worse than small ones.

Mean Absolute Error (MAE)

MAE takes a simpler approach — it averages the absolute prediction errors without squaring:

MAE = (1/n) * sum( |y_i - y_hat_i| )

MAE is also in the same units as the outcome and is easier to interpret: it is literally the average size of the prediction errors. Unlike RMSE, MAE treats all errors equally — a $100,000 error contributes exactly 10 times as much as a $10,000 error. MAE is more robust to outliers than RMSE. If your data has a few extreme values that are hard to predict, RMSE will be dominated by those outliers, while MAE gives you a better sense of typical prediction quality.

R-squared (Coefficient of Determination)

R-squared measures the proportion of variance in the outcome that is explained by the model's predictions:

R-squared = 1 - ( sum( (y_i - y_hat_i)^2 ) / sum( (y_i - y_bar)^2 ) )

R-squared is unitless and ranges from negative infinity to 1. An R-squared of 1.0 means perfect prediction. An R-squared of 0.0 means the model does no better than simply predicting the mean of y for every observation. An R-squared below zero means the model is worse than predicting the mean — this can happen with out-of-sample evaluation when the model has overfit badly. The key advantage of R-squared is that it is scale-free — you can compare R-squared values across different outcome variables (e.g., predicting income vs. predicting GDP growth) in a way that RMSE does not allow.

Mean Absolute Percentage Error (MAPE)

MAPE expresses errors as a percentage of the actual values:

MAPE = (100/n) * sum( |y_i - y_hat_i| / |y_i| )

MAPE is intuitive — a MAPE of 8% means predictions are off by about 8% on average. However, MAPE has a serious limitation: it is undefined when y_i = 0 and becomes extremely large when y_i is close to zero. For this reason, MAPE works best when the outcome variable is strictly positive and does not have values near zero (e.g., GDP, house prices, firm revenue).

When to Use RMSE

  • Large errors are disproportionately costly
  • You want a metric in the same units as the outcome
  • Standard default for most ML applications

When to Use MAE

  • All error sizes are equally important
  • Data contains outliers or extreme values
  • You want a robust, easily interpretable metric
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

# Load data and split
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Train a random forest
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Generate predictions on the TEST set
y_pred = rf.predict(X_test)

# Compute all regression metrics
rmse  = mean_squared_error(y_test, y_pred, squared=False)
mae   = mean_absolute_error(y_test, y_pred)
r2    = r2_score(y_test, y_pred)
mape  = mean_absolute_percentage_error(y_test, y_pred)

print("Regression Metrics (Test Set)")
print("=" * 40)
print(f"RMSE:       {rmse:.4f}")
print(f"MAE:        {mae:.4f}")
print(f"R-squared:  {r2:.4f}")
print(f"MAPE:       {mape:.2%}")
print(f"\nInterpretation:")
print(f"  Predictions are off by ~${mae * 100000:,.0f} on average (MAE)")
print(f"  The model explains {r2:.1%} of variance in house values")
* Stata: Computing regression metrics manually on the test set
sysuse auto, clear
set seed 42

* Create train/test split
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Fit model on training data
quietly reg price mpg weight length if is_test == 0
predict double yhat

* Compute metrics on TEST set only
gen double error    = price - yhat         if is_test == 1
gen double error_sq = error^2               if is_test == 1
gen double error_ab = abs(error)            if is_test == 1
gen double pct_err  = abs(error / price)    if is_test == 1

* RMSE
quietly summarize error_sq
scalar rmse = sqrt(r(mean))

* MAE
quietly summarize error_ab
scalar mae = r(mean)

* R-squared (out-of-sample)
quietly summarize price if is_test == 1
scalar ybar = r(mean)
gen double ss_tot = (price - ybar)^2 if is_test == 1
quietly summarize ss_tot
scalar ss_total = r(sum)
quietly summarize error_sq
scalar ss_res = r(sum)
scalar r2_oos = 1 - ss_res / ss_total

* MAPE
quietly summarize pct_err
scalar mape = r(mean) * 100

* Display results
display "Regression Metrics (Test Set)"
display "========================================"
display "RMSE:       " %9.2f rmse
display "MAE:        " %9.2f mae
display "R-squared:  " %9.4f r2_oos
display "MAPE:       " %9.2f mape "%"
# R: Computing regression metrics with yardstick
library(rsample)
library(randomForest)
library(yardstick)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

# Train a random forest on the training set
rf_model <- randomForest(mpg ~ ., data = train, ntree = 200)

# Predict on the TEST set
test$predicted <- predict(rf_model, newdata = test)

# Compute metrics using yardstick
results <- test |>
  summarise(
    RMSE  = sqrt(mean((mpg - predicted)^2)),
    MAE   = mean(abs(mpg - predicted)),
    R2    = 1 - sum((mpg - predicted)^2) / sum((mpg - mean(mpg))^2),
    MAPE  = mean(abs((mpg - predicted) / mpg)) * 100
  )

# Alternatively, use yardstick functions directly
cat("Regression Metrics (Test Set)\n")
cat("========================================\n")
cat(sprintf("RMSE:       %.4f\n", results$RMSE))
cat(sprintf("MAE:        %.4f\n", results$MAE))
cat(sprintf("R-squared:  %.4f\n", results$R2))
cat(sprintf("MAPE:       %.2f%%\n", results$MAPE))
Python Output
Regression Metrics (Test Set)
========================================
RMSE:       0.5010
MAE:        0.3272
R-squared:  0.8098
MAPE:       17.25%

Interpretation:
  Predictions are off by ~$32,720 on average (MAE)
  The model explains 81.0% of variance in house values
Stata Output
Regression Metrics (Test Set)
========================================
RMSE:        2461.38
MAE:         1894.52
R-squared:     0.3271
MAPE:         29.87%
R Output
Regression Metrics (Test Set)
========================================
RMSE:       2.7134
MAE:        2.1026
R-squared:  0.8523
MAPE:       10.84%

Classification Metrics

When the outcome is categorical — whether a firm defaults, whether a household participates in a program, whether an email is spam — the evaluation metrics are fundamentally different from regression. Prediction errors are no longer continuous deviations; they are discrete mistakes: the model said "yes" but the truth was "no," or vice versa. Classification metrics quantify these discrete errors in different ways, and choosing the right metric is critical because they often disagree about which model is "best."

The foundation of all classification metrics is the confusion matrix. For a binary classification problem, it is a 2x2 table with four cells:

Confusion Matrix

Predicted Positive
Predicted Negative
Actually Positive
TP
True Positive
Correctly predicted +
FN
False Negative
Missed (Type II error)
Actually Negative
FP
False Positive
False alarm (Type I error)
TN
True Negative
Correctly predicted −

Green cells are correct predictions; red cells are errors. Every classification metric is computed from these four counts.

Accuracy

Accuracy is the simplest metric: the fraction of all predictions that are correct.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy is intuitive but deeply misleading with imbalanced classes. If only 2% of firms default, a model that always predicts "no default" achieves 98% accuracy while being completely useless for the task. For this reason, accuracy should almost never be used as the primary metric for imbalanced classification problems.

Precision and Recall

Precision answers: "Of all observations the model labeled as positive, how many actually were positive?" It measures the reliability of positive predictions.

Precision = TP / (TP + FP)

Recall (also called sensitivity or true positive rate) answers: "Of all observations that were actually positive, how many did the model identify?" It measures the completeness of positive detection.

Recall = TP / (TP + FN)

There is an inherent trade-off between precision and recall. Lowering the classification threshold (being more willing to predict "positive") increases recall but decreases precision, and vice versa. The right balance depends on the costs of each type of error: in fraud detection, missing a fraud (low recall) is typically worse than flagging a legitimate transaction (low precision); in medical screening, the reverse may be true.

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

F1 ranges from 0 to 1. It equals 1 only when both precision and recall are perfect. The harmonic mean penalizes imbalance: if either precision or recall is very low, F1 will be low even if the other is high. This makes F1 a better summary metric than accuracy for imbalanced problems.

AUC-ROC

The AUC-ROC evaluates the model's ability to discriminate between positive and negative classes across all possible thresholds, rather than at a single threshold. An AUC of 0.5 means the model is no better than random guessing; an AUC of 1.0 means perfect discrimination. AUC-ROC is threshold-independent, making it useful for comparing models when you have not yet decided on a classification threshold.

Precision-Focused Problems

  • Spam filtering (users hate false alarms)
  • Drug approval (false positives have high costs)
  • Targeting scarce program resources

Recall-Focused Problems

  • Fraud detection (missing fraud is very costly)
  • Disease screening (missing cases is dangerous)
  • Identifying at-risk students or firms
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# Load breast cancer data (binary classification)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

# Predictions and predicted probabilities
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]  # Probability of positive class

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")

# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Malignant", "Benign"]))

# AUC-ROC
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")
* Stata: Classification metrics after logistic regression
webuse lbw, clear

* Binary outcome: low birth weight (low = 1 if birth weight < 2500g)
set seed 42
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Fit logistic regression on training data
logit low age lwt i.race smoke ptl ht ui if is_test == 0

* Predict probabilities for all observations
predict double phat, pr

* Classify using 0.5 threshold
gen byte y_pred = (phat >= 0.5)

* Confusion matrix on test set
display "Confusion Matrix (Test Set):"
tab low y_pred if is_test == 1

* estat classification gives sensitivity, specificity, etc.
* Run logit on test set to use estat (or compute manually)
quietly logit low age lwt i.race smoke ptl ht ui if is_test == 0
estat classification if is_test == 1

* ROC curve and AUC
lroc if is_test == 1, nograph
# R: Classification metrics using yardstick
library(rsample)
library(parsnip)
library(yardstick)

# Load data (built-in two-class dataset)
data(two_class_example, package = "yardstick")

# Compute confusion matrix
cat("Confusion Matrix:\n")
print(conf_mat(two_class_example, truth = truth, estimate = predicted))

# Compute all key classification metrics
metrics_result <- two_class_example |>
  summarise(
    Accuracy  = accuracy_vec(truth, predicted),
    Precision = precision_vec(truth, predicted),
    Recall    = recall_vec(truth, predicted),
    F1        = f_meas_vec(truth, predicted)
  )

cat("\nClassification Metrics:\n")
cat(sprintf("  Accuracy:  %.4f\n", metrics_result$Accuracy))
cat(sprintf("  Precision: %.4f\n", metrics_result$Precision))
cat(sprintf("  Recall:    %.4f\n", metrics_result$Recall))
cat(sprintf("  F1 Score:  %.4f\n", metrics_result$F1))

# AUC-ROC (requires predicted probabilities)
auc_val <- roc_auc_vec(
  two_class_example$truth,
  two_class_example$Class1
)
cat(sprintf("  AUC-ROC:   %.4f\n", auc_val))
Python Output
Confusion Matrix:
  TN=40  FP=3
  FN=2   TP=69

Classification Report:
              precision    recall  f1-score   support

   Malignant       0.95      0.93      0.94        43
      Benign       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

AUC-ROC: 0.9952
Stata Output
Confusion Matrix (Test Set):
           |    y_pred
       low |         0          1 |     Total
-----------+----------------------+----------
         0 |        23          4 |        27
         1 |         5          6 |        11
-----------+----------------------+----------
     Total |        28         10 |        38

Logistic model for low
              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |         6             4  |         10
     -     |         5            23  |         28
-----------+--------------------------+-----------
   Total   |        11            27  |         38

Sensitivity                  Pr( +| D)   54.55%
Specificity                  Pr( -|~D)   85.19%
Positive predictive value    Pr( D| +)   60.00%
Negative predictive value    Pr(~D| -)   82.14%
Correctly classified                     76.32%

Area under ROC curve = 0.7828
R Output
Confusion Matrix:
            Truth
Prediction Class1 Class2
    Class1    227     27
    Class2     23    223

Classification Metrics:
  Accuracy:  0.9000
  Precision: 0.8937
  Recall:    0.9080
  F1 Score:  0.9008
  AUC-ROC:   0.9406

Hyperparameter Tuning

Most machine learning models have hyperparameters — settings that must be chosen before training begins and that can dramatically affect performance. The number of trees in a random forest, the regularization penalty in LASSO, the depth of a decision tree, the learning rate of a neural network: none of these are learned from the data, and all of them matter.

The naive approach — trying a few values by hand and picking whatever looks best on the test set — is statistically dangerous. If you try 100 hyperparameter combinations and select the one with the best test-set performance, you have effectively used the test set for model selection, and your reported performance is optimistically biased. The solution is to use a validation set (separate from both training and test) or, better, to use cross-validation within the training set to evaluate hyperparameter choices, reserving the test set for final evaluation only.

Grid Search

Grid search evaluates every combination of a predefined set of hyperparameter values. If you specify 5 values for the regularization parameter and 4 values for the tree depth, grid search trains and evaluates 5 x 4 = 20 models. It is simple, exhaustive, and parallelizable, but becomes prohibitively expensive when the hyperparameter space is large (the "curse of dimensionality" applies to hyperparameter grids too).

Random Search

Random search samples hyperparameter combinations randomly from specified distributions. It is often more efficient than grid search because it explores the space more broadly. A classic result by Bergstra and Bengio (2012) showed that random search finds good hyperparameters in fewer evaluations than grid search, especially when some hyperparameters matter much more than others.

Bayesian Optimization

Bayesian optimization builds a probabilistic model of the objective function (e.g., cross-validated accuracy as a function of hyperparameters) and uses it to intelligently choose the next combination to evaluate. It focuses exploration on promising regions of the hyperparameter space, making it more sample-efficient than grid or random search. Libraries like optuna (Python) and mlrMBO (R) implement this approach.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Load and split data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_leaf': [1, 5, 10]
}

# Grid search with 5-fold CV (3 x 3 x 3 = 27 combinations)
grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Results
print(f"\nBest hyperparameters:")
for param, val in grid_search.best_params_.items():
    print(f"  {param}: {val}")

print(f"\nBest CV RMSE: {(-grid_search.best_score_)**0.5:.4f}")

# Evaluate best model on held-out TEST set
from sklearn.metrics import mean_squared_error
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Test RMSE (final, unbiased): {test_rmse:.4f}")
* Stata: Manual hyperparameter tuning via cross-validation loop
* Stata does not have a built-in GridSearchCV equivalent,
* so we loop over candidate values manually.

sysuse auto, clear
set seed 42

* Create 5 folds for cross-validation
gen fold = ceil(5 * runiform())

* Define candidate penalty values (ridge/LASSO-like tuning)
* We'll tune the number of predictors included via stepwise
local best_rmse = 99999
local best_spec = ""

* Loop over candidate model specifications
foreach spec in "mpg weight" "mpg weight length" "mpg weight length foreign" "mpg weight length foreign headroom trunk" {

    local total_se = 0
    local n_test = 0

    forvalues k = 1/5 {
        quietly reg price `spec' if fold != `k'
        quietly predict double yhat_`k' if fold == `k'
        quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
        quietly summarize se_`k'
        local total_se = `total_se' + r(sum)
        local n_test = `n_test' + r(N)
        drop yhat_`k' se_`k'
    }

    local cv_rmse = sqrt(`total_se' / `n_test')
    display "Spec: `spec'  =>  CV RMSE = " %9.2f `cv_rmse'

    if `cv_rmse' < `best_rmse' {
        local best_rmse = `cv_rmse'
        local best_spec = "`spec'"
    }
}

display _newline "Best specification: `best_spec'"
display "Best CV RMSE:       " %9.2f `best_rmse'
# R: Hyperparameter tuning with tidymodels tune_grid()
library(tidymodels)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)
folds <- vfold_cv(train, v = 5)

# Define a random forest model with tunable hyperparameters
rf_spec <- rand_forest(
  trees = tune(),
  min_n = tune()
) |>
  set_engine("ranger") |>
  set_mode("regression")

# Create a workflow
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)

# Define a hyperparameter grid
rf_grid <- grid_regular(
  trees(range = c(50, 300)),
  min_n(range = c(2, 10)),
  levels = 4
)

# Tune using 5-fold CV
rf_tuned <- tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = metric_set(rmse, rsq)
)

# Show best results
cat("Top 5 configurations by RMSE:\n")
print(show_best(rf_tuned, metric = "rmse", n = 5))

# Finalize workflow with best hyperparameters
best_params <- select_best(rf_tuned, metric = "rmse")
final_wf <- finalize_workflow(rf_wf, best_params)
final_fit <- last_fit(final_wf, split)

cat("\nFinal test set metrics:\n")
print(collect_metrics(final_fit))
Python Output
Fitting 5 folds for each of 27 candidates, totalling 135 fits

Best hyperparameters:
  max_depth: 20
  min_samples_leaf: 1
  n_estimators: 200

Best CV RMSE: 0.5027
Test RMSE (final, unbiased): 0.4986
Stata Output
Spec: mpg weight                             =>  CV RMSE =   2648.93
Spec: mpg weight length                      =>  CV RMSE =   2541.37
Spec: mpg weight length foreign              =>  CV RMSE =   2486.12
Spec: mpg weight length foreign headroom trunk =>  CV RMSE =   2592.81

Best specification: mpg weight length foreign
Best CV RMSE:        2486.12
R Output
Top 5 configurations by RMSE:
# A tibble: 5 x 8
  trees min_n .metric .estimator  mean     n std_err .config
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1   216     2 rmse    standard    2.68     5   0.412 Preprocessor1_Model06
2   300     2 rmse    standard    2.71     5   0.398 Preprocessor1_Model08
3   133     2 rmse    standard    2.73     5   0.425 Preprocessor1_Model04
4   216     5 rmse    standard    2.79     5   0.381 Preprocessor1_Model07
5   300     5 rmse    standard    2.82     5   0.393 Preprocessor1_Model09

Final test set metrics:
# A tibble: 2 x 4
  .metric .estimator .estimate .config
  <chr>   <chr>          <dbl> <chr>
1 rmse    standard       2.54  Preprocessor1_Model1
2 rsq     standard       0.867 Preprocessor1_Model1

Overfitting Diagnostics

How can you tell whether a model is underfitting, overfitting, or well-calibrated? The most informative diagnostic tools are learning curves and validation curves. These plots show you not just the final performance number, but how performance changes as you vary the training set size or a hyperparameter. They make the bias-variance trade-off visible.

Learning Curves

A learning curve plots both training-set and validation-set performance as the size of the training set increases. The patterns are diagnostic:

  • Underfitting: Both training and validation scores are low and converge quickly. Adding more data will not help — the model is too simple to capture the pattern. You need a more flexible model.
  • Overfitting: Training score is high, but validation score is significantly lower. The gap between the two curves is large. Adding more data can help (the validation curve slowly rises), or you can regularize the model.
  • Good fit: Training score is high, validation score is close to it, and both stabilize as the training set grows. The model has learned the true pattern without memorizing noise.

Validation Curves

A validation curve plots training and validation performance as a function of a single hyperparameter (e.g., tree depth, regularization strength). It reveals the "sweet spot" — the hyperparameter value where validation performance peaks before overfitting takes over.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Compute learning curve
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestRegressor(n_estimators=100, random_state=42),
    X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Convert to RMSE
train_rmse = np.sqrt(-train_scores)
val_rmse   = np.sqrt(-val_scores)

# Plot
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_rmse.mean(axis=1), 'o-', label='Training RMSE')
plt.plot(train_sizes, val_rmse.mean(axis=1), 'o-', label='Validation RMSE')
plt.fill_between(train_sizes,
    val_rmse.mean(axis=1) - val_rmse.std(axis=1),
    val_rmse.mean(axis=1) + val_rmse.std(axis=1), alpha=0.1)
plt.xlabel('Training Set Size')
plt.ylabel('RMSE')
plt.title('Learning Curve — Random Forest')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print numerical summary
print("Learning Curve Summary")
print("=" * 55)
print(f"{'Train Size':>12} {'Train RMSE':>12} {'Val RMSE':>12}")
print("-" * 55)
for i in range(len(train_sizes)):
    print(f"{train_sizes[i]:>12,} {train_rmse.mean(axis=1)[i]:>12.4f} {val_rmse.mean(axis=1)[i]:>12.4f}")
* Stata: Manual learning curve computation
* Train on increasing fractions of data, evaluate on held-out test set

sysuse auto, clear
set seed 42

* Reserve 20% as permanent test set
gen rand = runiform()
gen byte is_test = (rand > 0.8)

* Count training observations
quietly count if is_test == 0
local n_train = r(N)

* Generate a random ordering for training observations
gen train_order = runiform() if is_test == 0
sort is_test train_order

* Loop over increasing training set fractions
display "Learning Curve: RMSE by Training Set Size"
display "==========================================="
display "{txt}  Train Size   Train RMSE    Test RMSE"
display "-------------------------------------------"

foreach frac in 0.2 0.4 0.6 0.8 1.0 {
    local n_use = floor(`frac' * `n_train')

    * Mark which training obs to use
    gen byte use_train = (_n <= `n_use') & (is_test == 0)

    * Fit model on subset
    quietly reg price mpg weight length if use_train == 1
    quietly predict double yhat

    * Train RMSE
    quietly gen double se_train = (price - yhat)^2 if use_train == 1
    quietly summarize se_train
    local rmse_train = sqrt(r(mean))

    * Test RMSE
    quietly gen double se_test = (price - yhat)^2 if is_test == 1
    quietly summarize se_test
    local rmse_test = sqrt(r(mean))

    display %12.0f `n_use' %12.2f `rmse_train' %12.2f `rmse_test'

    drop use_train yhat se_train se_test
}
# R: Custom learning curve function
library(randomForest)
library(rsample)

# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test  <- testing(split)

# Define training set fractions
fractions <- seq(0.3, 1.0, by = 0.1)
results <- data.frame(frac = fractions, n = integer(length(fractions)),
                       train_rmse = numeric(length(fractions)),
                       test_rmse = numeric(length(fractions)))

for (i in seq_along(fractions)) {
  n_use <- floor(fractions[i] * nrow(train))
  idx <- sample(nrow(train), n_use)
  train_sub <- train[idx, ]

  # Train model on subset
  rf <- randomForest(mpg ~ ., data = train_sub, ntree = 100)

  # Train RMSE
  train_pred <- predict(rf, train_sub)
  results$train_rmse[i] <- sqrt(mean((train_sub$mpg - train_pred)^2))

  # Test RMSE
  test_pred <- predict(rf, test)
  results$test_rmse[i] <- sqrt(mean((test$mpg - test_pred)^2))

  results$n[i] <- n_use
}

# Print results
cat("Learning Curve Summary\n")
cat("==========================================\n")
cat(sprintf("%12s %12s %12s\n", "Train Size", "Train RMSE", "Test RMSE"))
cat("------------------------------------------\n")
for (i in 1:nrow(results)) {
  cat(sprintf("%12d %12.4f %12.4f\n", results$n[i], results$train_rmse[i], results$test_rmse[i]))
}

# Plot
plot(results$n, results$test_rmse, type = "o", col = "red",
     xlab = "Training Set Size", ylab = "RMSE",
     main = "Learning Curve - Random Forest",
     ylim = c(0, max(results$test_rmse) * 1.1))
lines(results$n, results$train_rmse, type = "o", col = "blue")
legend("topright", legend = c("Training", "Test"),
       col = c("blue", "red"), lty = 1, pch = 1)
Python Output
Learning Curve Summary
=======================================================
  Train Size   Train RMSE     Val RMSE
-------------------------------------------------------
       1,651       0.1834       0.5823
       3,302       0.1921       0.5547
       4,953       0.1956       0.5388
       6,604       0.1963       0.5285
       8,256       0.1976       0.5218
       9,907       0.1987       0.5166
      11,558       0.1992       0.5121
      13,209       0.1998       0.5092
      14,860       0.2001       0.5068
      16,512       0.2005       0.5047
Stata Output
Learning Curve: RMSE by Training Set Size
===========================================
  Train Size   Train RMSE    Test RMSE
-------------------------------------------
          12      1847.35      3142.68
          23      2012.48      2784.51
          35      2098.63      2621.34
          47      2134.17      2512.89
          59      2156.42      2461.38
R Output
Learning Curve Summary
==========================================
  Train Size   Train RMSE    Test RMSE
------------------------------------------
           7       1.2341       4.8712
          10       1.3028       3.9456
          12       1.3894       3.5218
          15       1.4216       3.2847
          17       1.4587       3.1025
          20       1.4812       2.9483
          22       1.5034       2.8156
          25       1.5198       2.7134

Model Comparison Workflow

In practice, model evaluation is not about computing a single metric for a single model. It is a systematic process of comparing multiple candidate models to select the best one for your task. A reliable model comparison workflow follows these steps:

  1. Split your data into training and test sets (or use nested cross-validation for small datasets). The test set is locked away until the very end.
  2. Define candidate models: OLS, LASSO, random forest, gradient boosting, neural network, etc. Include simple baselines (e.g., "always predict the mean") to ensure your model adds value.
  3. Tune hyperparameters for each candidate using cross-validation on the training set only. Each model gets its best hyperparameters.
  4. Compare cross-validated performance across models. Use the same folds for all models to ensure a fair comparison.
  5. Select the best model and evaluate it once on the test set. This is your final, unbiased performance estimate. Do not go back and try more models after seeing the test-set result.

Prediction Task Evaluation

  • Goal: maximize out-of-sample prediction accuracy
  • Metrics: RMSE, MAE, AUC-ROC, F1
  • Method: cross-validation + held-out test set
  • Flexibility is a virtue if it improves predictions
  • Examples: forecasting, targeting, classification

Causal Task Evaluation

  • Goal: estimate a causal effect (ATE, CATE, ATT)
  • Metrics: bias, coverage of confidence intervals, MSE of the treatment effect estimate
  • Method: simulation studies, placebo tests, sensitivity analysis
  • Interpretability and identification matter more than raw fit
  • Examples: policy evaluation, treatment effects, DiD

The distinction above is fundamental. The entire model evaluation framework in this module — cross-validation, RMSE, AUC-ROC, hyperparameter tuning — is designed for prediction tasks. When the goal is causal inference, the evaluation criteria are different: you care about whether the estimated treatment effect is unbiased and whether the confidence intervals have correct coverage, not about whether the model predicts the outcome well. A model can be an excellent predictor but a terrible estimator of causal effects, and vice versa. See Module 11D (Causal ML) for evaluation strategies specific to causal tasks.

Common Pitfalls

1. Data Leakage

Data leakage occurs when information from the test set (or from the future, or from the outcome itself) sneaks into the training process. It is the most common and most dangerous mistake in applied ML. Examples include: normalizing features using the mean and standard deviation of the entire dataset (including test observations), using variables that are consequences of the outcome rather than predictors, or including the outcome variable as a feature through an indirect path. Leakage produces models that look spectacular during evaluation but fail completely in deployment. The fix: every preprocessing step (scaling, encoding, imputation) must be fit on the training set only and then applied to the test set.

2. Ignoring Temporal Structure

If your data has a time dimension — panel data, time series, event data — a random train-test split is invalid. It allows the model to "peek into the future," using observations from 2022 to predict outcomes in 2019. This produces an optimistic performance estimate that does not reflect real-world deployment, where only past data is available. Always use a temporal split: train on data from earlier periods and test on later periods. For cross-validation, use time-series CV (expanding or sliding window), not standard K-fold.

3. Choosing the Wrong Metric

The choice of evaluation metric should reflect the costs of different types of errors in your application. Accuracy is almost always the wrong metric for imbalanced classification. RMSE penalizes large errors heavily, which may or may not be appropriate. MAPE is undefined for zero values. Using the wrong metric means optimizing for the wrong goal: you may select a model that is "best" according to a metric that does not align with the actual decision problem. Always ask: "What does a prediction error cost in this application?" and choose metrics accordingly.

References

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. (Available free at hastie.su.domains/ElemStatLearn)
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. (Available free at statlearning.com)
  • Athey, S., & Imbens, G. W. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics, 11, 685-725.
  • Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.
  • Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.