11E Model Evaluation
Learning Objectives
- Understand why evaluating on training data is misleading and how overfitting distorts performance estimates
- Implement train-test splits and K-fold cross-validation in Python, Stata, and R
- Compute and interpret regression metrics: RMSE, MAE, R-squared, and MAPE
- Compute and interpret classification metrics: accuracy, precision, recall, F1, and AUC-ROC
- Use proper evaluation strategies to compare and select among competing models
Contents
Why Evaluation Matters
Every model you have encountered in this course — from OLS regression to LASSO, random forests, and neural networks — needs to be evaluated. But evaluated on what? The answer to this question is one of the most important ideas in all of machine learning, and it separates rigorous empirical work from misleading results.
The central problem is overfitting. A sufficiently flexible model — a deep neural network, a random forest with many trees, or even an OLS regression with hundreds of interaction terms — can always achieve excellent performance on the data it was trained on. It does this by memorizing the noise, the outliers, and the quirks of that specific dataset. The model's apparent "accuracy" is an illusion: it reflects the model's ability to recall the training data, not its ability to predict new, unseen observations.
Consider a simple example. Suppose you want to predict GDP growth using 20 macroeconomic indicators, and your dataset contains 50 country-year observations. If you fit a polynomial regression with enough interaction terms, you can achieve an R-squared of 1.0 on the training data — a "perfect" fit. But this model has essentially memorized your 50 data points. Give it data from a new year or a new country, and its predictions will be wildly inaccurate. The training-set R-squared of 1.0 told you nothing useful about the model's real predictive ability.
The Cardinal Rule of Model Evaluation
Never evaluate a model on the same data used to train it. Training-set performance is always optimistically biased. The more flexible the model, the larger the gap between training performance and true out-of-sample performance. This is not a minor technicality — it is the difference between a model that works and one that only appears to work.
The solution is conceptually simple: set aside data that the model has never seen during training, and evaluate the model's predictions on that held-out data. The specific strategies for doing this — train-test splits and cross-validation — are the subject of the next two sections. But the underlying principle is always the same: the only honest measure of a model's quality is its performance on data it has not been trained on.
This principle matters just as much in economics as in computer science. If you are building a model to predict firm bankruptcy, forecast inflation, or identify households eligible for a targeted policy, you need to know how well it will perform when deployed on new data. Reporting only training-set performance is analogous to judging a student by giving them the exact same exam they used to study — it tells you about memorization, not understanding.
Train-Test Split
The simplest evaluation strategy is to split your data into two parts: a training set used to fit the model, and a test set (sometimes called a "hold-out set") used to evaluate it. The model never sees the test data during training — it is kept completely separate, like sealing an exam in an envelope until the student has finished studying.
A typical split is 80% training and 20% test, though the exact ratio depends on the size of your dataset. With very large datasets (hundreds of thousands of observations), even a 90/10 split leaves plenty of test data. With small datasets (a few hundred observations), you may need cross-validation instead (covered in the next section), because holding out 20% leaves too few observations for both reliable training and reliable evaluation.
Two important implementation details. First, the split should be random — if your data is sorted (by time, by country, by outcome), a non-random split would systematically exclude certain types of observations from training or testing, leading to biased results. Second, you should set a random seed so that your results are reproducible. Anyone running your code should get the same split and therefore the same results.
When Random Splits Are Wrong
If your data has a time dimension (panel data, time series), a random split can leak future information into the training set. For example, if you randomly split yearly GDP data, the model might train on 2020 data and predict 2018 — it has seen the future. In these cases, use a temporal split: train on earlier periods and test on later ones. This mimics the real prediction task, where you only have past data available.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
# Load California Housing data
housing = fetch_california_housing(as_frame=True)
X = housing.data # 20,640 observations, 8 features
y = housing.target # Median house value (in $100k)
# Split: 80% training, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(f"Full dataset: {X.shape[0]:,} observations")
print(f"Training set: {X_train.shape[0]:,} observations ({X_train.shape[0]/X.shape[0]:.0%})")
print(f"Test set: {X_test.shape[0]:,} observations ({X_test.shape[0]/X.shape[0]:.0%})")
print(f"\nFeatures: {X.shape[1]}")
print(f"Target mean: {y.mean():.3f} (train: {y_train.mean():.3f}, test: {y_test.mean():.3f})")
* Stata: Train-test split using random number generation
* Load example data
sysuse auto, clear
* Set seed for reproducibility
set seed 42
* Generate a uniform random variable and split at 80th percentile
gen rand = runiform()
gen byte split = cond(rand <= 0.8, 1, 0)
label define splitlbl 1 "Train" 0 "Test"
label values split splitlbl
* Verify the split
tab split
* Summarize the outcome in each split
tabstat price, by(split) stat(n mean sd min max) format(%9.1f)
* Fit model on training data only
reg price mpg weight length if split == 1
* Generate predictions for ALL observations (train + test)
predict yhat
* Evaluate on TEST set only
gen resid_sq = (price - yhat)^2 if split == 0
quietly summarize resid_sq
display "Test RMSE: " sqrt(r(mean))
# R: Train-test split using rsample (tidymodels framework)
library(rsample)
# Load built-in mtcars data (32 observations, 10 features)
data(mtcars)
# Create an 80/20 split (stratified splits are also possible)
set.seed(42)
split_obj <- initial_split(mtcars, prop = 0.8)
# Extract training and test sets
train_data <- training(split_obj)
test_data <- testing(split_obj)
cat("Full dataset: ", nrow(mtcars), "observations\n")
cat("Training set: ", nrow(train_data), "observations",
sprintf("(%0.0f%%)\n", 100 * nrow(train_data) / nrow(mtcars)))
cat("Test set: ", nrow(test_data), "observations",
sprintf("(%0.0f%%)\n", 100 * nrow(test_data) / nrow(mtcars)))
cat("\nTarget (mpg) mean — train:", round(mean(train_data$mpg), 2),
" test:", round(mean(test_data$mpg), 2), "\n")
Full dataset: 20,640 observations Training set: 16,512 observations (80%) Test set: 4,128 observations (20%) Features: 8 Target mean: 2.069 (train: 2.072, test: 2.055)
split | Freq. Percent Cum.
-----------+-----------------------------------
Train | 59 79.73 79.73
Test | 15 20.27 100.00
-----------+-----------------------------------
Total | 74 100.00
split | N Mean S.D. Min Max
----------+-----------------------------------------------------
Train | 59 6165.3 2949.5 3291 15906
Test | 15 6229.9 3217.1 3748 14500
----------+-----------------------------------------------------
Total | 74 6178.4 2993.0 3291 15906
Test RMSE: 2461.38Full dataset: 32 observations Training set: 25 observations (78%) Test set: 7 observations (22%) Target (mpg) mean — train: 19.96 test: 20.93
K-Fold Cross-Validation
A single train-test split has an obvious limitation: the evaluation depends on which observations happened to land in the test set. With a small dataset, a different random split could give substantially different results. You might get lucky (easy test observations) or unlucky (hard ones), and your performance estimate will be noisy.
K-fold cross-validation solves this by using every observation for both training and testing, just not at the same time. The procedure is straightforward: divide the data into K equally-sized subsets (called "folds"). Then, for each fold in turn, train the model on the other K-1 folds and evaluate it on the held-out fold. After K rounds, every observation has been in the test set exactly once. The final performance metric is the average across all K folds.
The most common choice is K = 5 or K = 10. With 5-fold CV, each fold contains 20% of the data, and the model is trained 5 times on 80% of the data. This gives a more reliable performance estimate than a single split, because it averages over 5 different train-test configurations. The cost is computational: you must train the model K times instead of once.
5-Fold Cross-Validation
Each observation appears in the test set exactly once. The final score is the average across all folds.
An important variant is stratified K-fold cross-validation, which ensures that each fold has roughly the same distribution of the target variable. This is particularly important for classification problems with imbalanced classes — if only 5% of firms in your dataset go bankrupt, a non-stratified split might create a fold with no bankrupt firms at all, making evaluation meaningless.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Define two competing models
ols = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# 5-fold cross-validation for both models
# scoring="r2" computes R-squared; scoring="neg_mean_squared_error" gives -MSE
ols_scores = cross_val_score(ols, X, y, cv=5, scoring="r2")
rf_scores = cross_val_score(rf, X, y, cv=5, scoring="r2")
print("5-Fold Cross-Validation: R-squared")
print("=" * 45)
print(f"OLS Regression:")
print(f" Per-fold scores: {ols_scores.round(3)}")
print(f" Mean: {ols_scores.mean():.3f} (Std: {ols_scores.std():.3f})")
print(f"\nRandom Forest:")
print(f" Per-fold scores: {rf_scores.round(3)}")
print(f" Mean: {rf_scores.mean():.3f} (Std: {rf_scores.std():.3f})")
* Stata: K-fold cross-validation using the crossfold command
* Install if needed: ssc install crossfold
sysuse auto, clear
set seed 42
* crossfold performs k-fold CV for regression models
* It reports the out-of-sample RMSE for each fold
crossfold reg price mpg weight length, k(5)
* The output shows RMSE for each fold and the overall average
* You can also do this manually for more control:
* Manual 5-fold CV
gen fold = ceil(5 * runiform())
forvalues k = 1/5 {
quietly reg price mpg weight length if fold != `k'
quietly predict double yhat_`k' if fold == `k'
quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
quietly summarize se_`k'
display "Fold `k': RMSE = " sqrt(r(mean))
}
* Compute overall CV RMSE
egen all_se = rowtotal(se_1-se_5), missing
quietly summarize all_se
display "Overall CV RMSE: " sqrt(r(mean))
# R: K-fold cross-validation using tidymodels
library(rsample)
library(parsnip)
library(yardstick)
library(tune)
library(workflows)
# Load data
data(mtcars)
# Create 5-fold cross-validation object
set.seed(42)
folds <- vfold_cv(mtcars, v = 5)
# Define two models
lm_spec <- linear_reg() |> set_engine("lm")
rf_spec <- rand_forest(trees = 100) |>
set_engine("ranger") |>
set_mode("regression")
# Create workflows (model + formula)
lm_wf <- workflow() |> add_model(lm_spec) |> add_formula(mpg ~ .)
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)
# Fit both models using 5-fold CV
lm_results <- fit_resamples(lm_wf, resamples = folds, metrics = metric_set(rsq, rmse))
rf_results <- fit_resamples(rf_wf, resamples = folds, metrics = metric_set(rsq, rmse))
# Collect and display average metrics
cat("=== OLS Regression ===\n")
print(collect_metrics(lm_results))
cat("\n=== Random Forest ===\n")
print(collect_metrics(rf_results))
5-Fold Cross-Validation: R-squared ============================================= OLS Regression: Per-fold scores: [0.596 0.609 0.593 0.607 0.592] Mean: 0.599 (Std: 0.007) Random Forest: Per-fold scores: [0.805 0.826 0.812 0.817 0.809] Mean: 0.814 (Std: 0.007)
Fold 1: Root Mean Squared Error = 2372.019 Fold 2: Root Mean Squared Error = 2781.445 Fold 3: Root Mean Squared Error = 2196.584 Fold 4: Root Mean Squared Error = 2534.717 Fold 5: Root Mean Squared Error = 2688.103 Fold 1: RMSE = 2410.52 Fold 2: RMSE = 2693.18 Fold 3: RMSE = 2247.91 Fold 4: RMSE = 2581.36 Fold 5: RMSE = 2732.40 Overall CV RMSE: 2541.37
=== OLS Regression === # A tibble: 2 x 6 .metric .estimator mean n std_err .config <chr> <chr> <dbl> <int> <dbl> <chr> 1 rmse standard 3.46 5 0.485 Preprocessor1_Model1 2 rsq standard 0.773 5 0.0631 Preprocessor1_Model1 === Random Forest === # A tibble: 2 x 6 .metric .estimator mean n std_err .config <chr> <chr> <dbl> <int> <dbl> <chr> 1 rmse standard 2.87 5 0.398 Preprocessor1_Model1 2 rsq standard 0.843 5 0.0412 Preprocessor1_Model1
Regression Metrics
Once you have a held-out test set (or cross-validation folds), you need a metric to quantify how well the model's predictions match the actual values. For regression problems — where the outcome is a continuous variable like income, GDP, or house prices — the most commonly used metrics are RMSE, MAE, R-squared, and MAPE. Each measures a different aspect of prediction quality, and the right choice depends on your application.
Root Mean Squared Error (RMSE)
RMSE is the most widely used regression metric. It computes the square root of the average squared prediction error:
RMSE is in the same units as the outcome variable — if you are predicting house prices in dollars, the RMSE is also in dollars. This makes it directly interpretable: an RMSE of $25,000 means the model's predictions are typically about $25,000 away from the true values. Because it squares the errors before averaging, RMSE penalizes large errors disproportionately. A single prediction that is off by $100,000 contributes more to the RMSE than ten predictions that are each off by $10,000. This is often desirable — in many applications, big mistakes are much worse than small ones.
Mean Absolute Error (MAE)
MAE takes a simpler approach — it averages the absolute prediction errors without squaring:
MAE is also in the same units as the outcome and is easier to interpret: it is literally the average size of the prediction errors. Unlike RMSE, MAE treats all errors equally — a $100,000 error contributes exactly 10 times as much as a $10,000 error. MAE is more robust to outliers than RMSE. If your data has a few extreme values that are hard to predict, RMSE will be dominated by those outliers, while MAE gives you a better sense of typical prediction quality.
R-squared (Coefficient of Determination)
R-squared measures the proportion of variance in the outcome that is explained by the model's predictions:
R-squared is unitless and ranges from negative infinity to 1. An R-squared of 1.0 means perfect prediction. An R-squared of 0.0 means the model does no better than simply predicting the mean of y for every observation. An R-squared below zero means the model is worse than predicting the mean — this can happen with out-of-sample evaluation when the model has overfit badly. The key advantage of R-squared is that it is scale-free — you can compare R-squared values across different outcome variables (e.g., predicting income vs. predicting GDP growth) in a way that RMSE does not allow.
Mean Absolute Percentage Error (MAPE)
MAPE expresses errors as a percentage of the actual values:
MAPE is intuitive — a MAPE of 8% means predictions are off by about 8% on average. However, MAPE has a serious limitation: it is undefined when y_i = 0 and becomes extremely large when y_i is close to zero. For this reason, MAPE works best when the outcome variable is strictly positive and does not have values near zero (e.g., GDP, house prices, firm revenue).
When to Use RMSE
- Large errors are disproportionately costly
- You want a metric in the same units as the outcome
- Standard default for most ML applications
When to Use MAE
- All error sizes are equally important
- Data contains outliers or extreme values
- You want a robust, easily interpretable metric
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
# Load data and split
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train a random forest
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Generate predictions on the TEST set
y_pred = rf.predict(X_test)
# Compute all regression metrics
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
print("Regression Metrics (Test Set)")
print("=" * 40)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
print(f"MAPE: {mape:.2%}")
print(f"\nInterpretation:")
print(f" Predictions are off by ~${mae * 100000:,.0f} on average (MAE)")
print(f" The model explains {r2:.1%} of variance in house values")
* Stata: Computing regression metrics manually on the test set
sysuse auto, clear
set seed 42
* Create train/test split
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Fit model on training data
quietly reg price mpg weight length if is_test == 0
predict double yhat
* Compute metrics on TEST set only
gen double error = price - yhat if is_test == 1
gen double error_sq = error^2 if is_test == 1
gen double error_ab = abs(error) if is_test == 1
gen double pct_err = abs(error / price) if is_test == 1
* RMSE
quietly summarize error_sq
scalar rmse = sqrt(r(mean))
* MAE
quietly summarize error_ab
scalar mae = r(mean)
* R-squared (out-of-sample)
quietly summarize price if is_test == 1
scalar ybar = r(mean)
gen double ss_tot = (price - ybar)^2 if is_test == 1
quietly summarize ss_tot
scalar ss_total = r(sum)
quietly summarize error_sq
scalar ss_res = r(sum)
scalar r2_oos = 1 - ss_res / ss_total
* MAPE
quietly summarize pct_err
scalar mape = r(mean) * 100
* Display results
display "Regression Metrics (Test Set)"
display "========================================"
display "RMSE: " %9.2f rmse
display "MAE: " %9.2f mae
display "R-squared: " %9.4f r2_oos
display "MAPE: " %9.2f mape "%"
# R: Computing regression metrics with yardstick
library(rsample)
library(randomForest)
library(yardstick)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
# Train a random forest on the training set
rf_model <- randomForest(mpg ~ ., data = train, ntree = 200)
# Predict on the TEST set
test$predicted <- predict(rf_model, newdata = test)
# Compute metrics using yardstick
results <- test |>
summarise(
RMSE = sqrt(mean((mpg - predicted)^2)),
MAE = mean(abs(mpg - predicted)),
R2 = 1 - sum((mpg - predicted)^2) / sum((mpg - mean(mpg))^2),
MAPE = mean(abs((mpg - predicted) / mpg)) * 100
)
# Alternatively, use yardstick functions directly
cat("Regression Metrics (Test Set)\n")
cat("========================================\n")
cat(sprintf("RMSE: %.4f\n", results$RMSE))
cat(sprintf("MAE: %.4f\n", results$MAE))
cat(sprintf("R-squared: %.4f\n", results$R2))
cat(sprintf("MAPE: %.2f%%\n", results$MAPE))
Regression Metrics (Test Set) ======================================== RMSE: 0.5010 MAE: 0.3272 R-squared: 0.8098 MAPE: 17.25% Interpretation: Predictions are off by ~$32,720 on average (MAE) The model explains 81.0% of variance in house values
Regression Metrics (Test Set) ======================================== RMSE: 2461.38 MAE: 1894.52 R-squared: 0.3271 MAPE: 29.87%
Regression Metrics (Test Set) ======================================== RMSE: 2.7134 MAE: 2.1026 R-squared: 0.8523 MAPE: 10.84%
Classification Metrics
When the outcome is categorical — whether a firm defaults, whether a household participates in a program, whether an email is spam — the evaluation metrics are fundamentally different from regression. Prediction errors are no longer continuous deviations; they are discrete mistakes: the model said "yes" but the truth was "no," or vice versa. Classification metrics quantify these discrete errors in different ways, and choosing the right metric is critical because they often disagree about which model is "best."
The foundation of all classification metrics is the confusion matrix. For a binary classification problem, it is a 2x2 table with four cells:
Confusion Matrix
Green cells are correct predictions; red cells are errors. Every classification metric is computed from these four counts.
Accuracy
Accuracy is the simplest metric: the fraction of all predictions that are correct.
Accuracy is intuitive but deeply misleading with imbalanced classes. If only 2% of firms default, a model that always predicts "no default" achieves 98% accuracy while being completely useless for the task. For this reason, accuracy should almost never be used as the primary metric for imbalanced classification problems.
Precision and Recall
Precision answers: "Of all observations the model labeled as positive, how many actually were positive?" It measures the reliability of positive predictions.
Recall (also called sensitivity or true positive rate) answers: "Of all observations that were actually positive, how many did the model identify?" It measures the completeness of positive detection.
There is an inherent trade-off between precision and recall. Lowering the classification threshold (being more willing to predict "positive") increases recall but decreases precision, and vice versa. The right balance depends on the costs of each type of error: in fraud detection, missing a fraud (low recall) is typically worse than flagging a legitimate transaction (low precision); in medical screening, the reverse may be true.
F1 Score
The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both:
F1 ranges from 0 to 1. It equals 1 only when both precision and recall are perfect. The harmonic mean penalizes imbalance: if either precision or recall is very low, F1 will be low even if the other is high. This makes F1 a better summary metric than accuracy for imbalanced problems.
AUC-ROC
The AUC-ROC evaluates the model's ability to discriminate between positive and negative classes across all possible thresholds, rather than at a single threshold. An AUC of 0.5 means the model is no better than random guessing; an AUC of 1.0 means perfect discrimination. AUC-ROC is threshold-independent, making it useful for comparing models when you have not yet decided on a classification threshold.
Precision-Focused Problems
- Spam filtering (users hate false alarms)
- Drug approval (false positives have high costs)
- Targeting scarce program resources
Recall-Focused Problems
- Fraud detection (missing fraud is very costly)
- Disease screening (missing cases is dangerous)
- Identifying at-risk students or firms
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score
)
# Load breast cancer data (binary classification)
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)
# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)
# Predictions and predicted probabilities
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1] # Probability of positive class
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f" TN={cm[0,0]} FP={cm[0,1]}")
print(f" FN={cm[1,0]} TP={cm[1,1]}")
# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Malignant", "Benign"]))
# AUC-ROC
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")
* Stata: Classification metrics after logistic regression
webuse lbw, clear
* Binary outcome: low birth weight (low = 1 if birth weight < 2500g)
set seed 42
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Fit logistic regression on training data
logit low age lwt i.race smoke ptl ht ui if is_test == 0
* Predict probabilities for all observations
predict double phat, pr
* Classify using 0.5 threshold
gen byte y_pred = (phat >= 0.5)
* Confusion matrix on test set
display "Confusion Matrix (Test Set):"
tab low y_pred if is_test == 1
* estat classification gives sensitivity, specificity, etc.
* Run logit on test set to use estat (or compute manually)
quietly logit low age lwt i.race smoke ptl ht ui if is_test == 0
estat classification if is_test == 1
* ROC curve and AUC
lroc if is_test == 1, nograph
# R: Classification metrics using yardstick
library(rsample)
library(parsnip)
library(yardstick)
# Load data (built-in two-class dataset)
data(two_class_example, package = "yardstick")
# Compute confusion matrix
cat("Confusion Matrix:\n")
print(conf_mat(two_class_example, truth = truth, estimate = predicted))
# Compute all key classification metrics
metrics_result <- two_class_example |>
summarise(
Accuracy = accuracy_vec(truth, predicted),
Precision = precision_vec(truth, predicted),
Recall = recall_vec(truth, predicted),
F1 = f_meas_vec(truth, predicted)
)
cat("\nClassification Metrics:\n")
cat(sprintf(" Accuracy: %.4f\n", metrics_result$Accuracy))
cat(sprintf(" Precision: %.4f\n", metrics_result$Precision))
cat(sprintf(" Recall: %.4f\n", metrics_result$Recall))
cat(sprintf(" F1 Score: %.4f\n", metrics_result$F1))
# AUC-ROC (requires predicted probabilities)
auc_val <- roc_auc_vec(
two_class_example$truth,
two_class_example$Class1
)
cat(sprintf(" AUC-ROC: %.4f\n", auc_val))
Confusion Matrix:
TN=40 FP=3
FN=2 TP=69
Classification Report:
precision recall f1-score support
Malignant 0.95 0.93 0.94 43
Benign 0.96 0.97 0.97 71
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
AUC-ROC: 0.9952Confusion Matrix (Test Set):
| y_pred
low | 0 1 | Total
-----------+----------------------+----------
0 | 23 4 | 27
1 | 5 6 | 11
-----------+----------------------+----------
Total | 28 10 | 38
Logistic model for low
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 6 4 | 10
- | 5 23 | 28
-----------+--------------------------+-----------
Total | 11 27 | 38
Sensitivity Pr( +| D) 54.55%
Specificity Pr( -|~D) 85.19%
Positive predictive value Pr( D| +) 60.00%
Negative predictive value Pr(~D| -) 82.14%
Correctly classified 76.32%
Area under ROC curve = 0.7828Confusion Matrix:
Truth
Prediction Class1 Class2
Class1 227 27
Class2 23 223
Classification Metrics:
Accuracy: 0.9000
Precision: 0.8937
Recall: 0.9080
F1 Score: 0.9008
AUC-ROC: 0.9406Hyperparameter Tuning
Most machine learning models have hyperparameters — settings that must be chosen before training begins and that can dramatically affect performance. The number of trees in a random forest, the regularization penalty in LASSO, the depth of a decision tree, the learning rate of a neural network: none of these are learned from the data, and all of them matter.
The naive approach — trying a few values by hand and picking whatever looks best on the test set — is statistically dangerous. If you try 100 hyperparameter combinations and select the one with the best test-set performance, you have effectively used the test set for model selection, and your reported performance is optimistically biased. The solution is to use a validation set (separate from both training and test) or, better, to use cross-validation within the training set to evaluate hyperparameter choices, reserving the test set for final evaluation only.
Grid Search
Grid search evaluates every combination of a predefined set of hyperparameter values. If you specify 5 values for the regularization parameter and 4 values for the tree depth, grid search trains and evaluates 5 x 4 = 20 models. It is simple, exhaustive, and parallelizable, but becomes prohibitively expensive when the hyperparameter space is large (the "curse of dimensionality" applies to hyperparameter grids too).
Random Search
Random search samples hyperparameter combinations randomly from specified distributions. It is often more efficient than grid search because it explores the space more broadly. A classic result by Bergstra and Bengio (2012) showed that random search finds good hyperparameters in fewer evaluations than grid search, especially when some hyperparameters matter much more than others.
Bayesian Optimization
Bayesian optimization builds a probabilistic model of the objective function (e.g., cross-validated accuracy as a function of hyperparameters) and uses it to intelligently choose the next combination to evaluate. It focuses exploration on promising regions of the hyperparameter space, making it more sample-efficient than grid or random search. Libraries like optuna (Python) and mlrMBO (R) implement this approach.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
# Load and split data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Define the hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_leaf': [1, 5, 10]
}
# Grid search with 5-fold CV (3 x 3 x 3 = 27 combinations)
grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
# Results
print(f"\nBest hyperparameters:")
for param, val in grid_search.best_params_.items():
print(f" {param}: {val}")
print(f"\nBest CV RMSE: {(-grid_search.best_score_)**0.5:.4f}")
# Evaluate best model on held-out TEST set
from sklearn.metrics import mean_squared_error
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Test RMSE (final, unbiased): {test_rmse:.4f}")
* Stata: Manual hyperparameter tuning via cross-validation loop
* Stata does not have a built-in GridSearchCV equivalent,
* so we loop over candidate values manually.
sysuse auto, clear
set seed 42
* Create 5 folds for cross-validation
gen fold = ceil(5 * runiform())
* Define candidate penalty values (ridge/LASSO-like tuning)
* We'll tune the number of predictors included via stepwise
local best_rmse = 99999
local best_spec = ""
* Loop over candidate model specifications
foreach spec in "mpg weight" "mpg weight length" "mpg weight length foreign" "mpg weight length foreign headroom trunk" {
local total_se = 0
local n_test = 0
forvalues k = 1/5 {
quietly reg price `spec' if fold != `k'
quietly predict double yhat_`k' if fold == `k'
quietly gen double se_`k' = (price - yhat_`k')^2 if fold == `k'
quietly summarize se_`k'
local total_se = `total_se' + r(sum)
local n_test = `n_test' + r(N)
drop yhat_`k' se_`k'
}
local cv_rmse = sqrt(`total_se' / `n_test')
display "Spec: `spec' => CV RMSE = " %9.2f `cv_rmse'
if `cv_rmse' < `best_rmse' {
local best_rmse = `cv_rmse'
local best_spec = "`spec'"
}
}
display _newline "Best specification: `best_spec'"
display "Best CV RMSE: " %9.2f `best_rmse'
# R: Hyperparameter tuning with tidymodels tune_grid()
library(tidymodels)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train, v = 5)
# Define a random forest model with tunable hyperparameters
rf_spec <- rand_forest(
trees = tune(),
min_n = tune()
) |>
set_engine("ranger") |>
set_mode("regression")
# Create a workflow
rf_wf <- workflow() |> add_model(rf_spec) |> add_formula(mpg ~ .)
# Define a hyperparameter grid
rf_grid <- grid_regular(
trees(range = c(50, 300)),
min_n(range = c(2, 10)),
levels = 4
)
# Tune using 5-fold CV
rf_tuned <- tune_grid(
rf_wf,
resamples = folds,
grid = rf_grid,
metrics = metric_set(rmse, rsq)
)
# Show best results
cat("Top 5 configurations by RMSE:\n")
print(show_best(rf_tuned, metric = "rmse", n = 5))
# Finalize workflow with best hyperparameters
best_params <- select_best(rf_tuned, metric = "rmse")
final_wf <- finalize_workflow(rf_wf, best_params)
final_fit <- last_fit(final_wf, split)
cat("\nFinal test set metrics:\n")
print(collect_metrics(final_fit))
Fitting 5 folds for each of 27 candidates, totalling 135 fits Best hyperparameters: max_depth: 20 min_samples_leaf: 1 n_estimators: 200 Best CV RMSE: 0.5027 Test RMSE (final, unbiased): 0.4986
Spec: mpg weight => CV RMSE = 2648.93 Spec: mpg weight length => CV RMSE = 2541.37 Spec: mpg weight length foreign => CV RMSE = 2486.12 Spec: mpg weight length foreign headroom trunk => CV RMSE = 2592.81 Best specification: mpg weight length foreign Best CV RMSE: 2486.12
Top 5 configurations by RMSE: # A tibble: 5 x 8 trees min_n .metric .estimator mean n std_err .config <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> 1 216 2 rmse standard 2.68 5 0.412 Preprocessor1_Model06 2 300 2 rmse standard 2.71 5 0.398 Preprocessor1_Model08 3 133 2 rmse standard 2.73 5 0.425 Preprocessor1_Model04 4 216 5 rmse standard 2.79 5 0.381 Preprocessor1_Model07 5 300 5 rmse standard 2.82 5 0.393 Preprocessor1_Model09 Final test set metrics: # A tibble: 2 x 4 .metric .estimator .estimate .config <chr> <chr> <dbl> <chr> 1 rmse standard 2.54 Preprocessor1_Model1 2 rsq standard 0.867 Preprocessor1_Model1
Overfitting Diagnostics
How can you tell whether a model is underfitting, overfitting, or well-calibrated? The most informative diagnostic tools are learning curves and validation curves. These plots show you not just the final performance number, but how performance changes as you vary the training set size or a hyperparameter. They make the bias-variance trade-off visible.
Learning Curves
A learning curve plots both training-set and validation-set performance as the size of the training set increases. The patterns are diagnostic:
- Underfitting: Both training and validation scores are low and converge quickly. Adding more data will not help — the model is too simple to capture the pattern. You need a more flexible model.
- Overfitting: Training score is high, but validation score is significantly lower. The gap between the two curves is large. Adding more data can help (the validation curve slowly rises), or you can regularize the model.
- Good fit: Training score is high, validation score is close to it, and both stabilize as the training set grows. The model has learned the true pattern without memorizing noise.
Validation Curves
A validation curve plots training and validation performance as a function of a single hyperparameter (e.g., tree depth, regularization strength). It reveals the "sweet spot" — the hyperparameter value where validation performance peaks before overfitting takes over.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Compute learning curve
train_sizes, train_scores, val_scores = learning_curve(
RandomForestRegressor(n_estimators=100, random_state=42),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)
# Convert to RMSE
train_rmse = np.sqrt(-train_scores)
val_rmse = np.sqrt(-val_scores)
# Plot
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_rmse.mean(axis=1), 'o-', label='Training RMSE')
plt.plot(train_sizes, val_rmse.mean(axis=1), 'o-', label='Validation RMSE')
plt.fill_between(train_sizes,
val_rmse.mean(axis=1) - val_rmse.std(axis=1),
val_rmse.mean(axis=1) + val_rmse.std(axis=1), alpha=0.1)
plt.xlabel('Training Set Size')
plt.ylabel('RMSE')
plt.title('Learning Curve — Random Forest')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print numerical summary
print("Learning Curve Summary")
print("=" * 55)
print(f"{'Train Size':>12} {'Train RMSE':>12} {'Val RMSE':>12}")
print("-" * 55)
for i in range(len(train_sizes)):
print(f"{train_sizes[i]:>12,} {train_rmse.mean(axis=1)[i]:>12.4f} {val_rmse.mean(axis=1)[i]:>12.4f}")
* Stata: Manual learning curve computation
* Train on increasing fractions of data, evaluate on held-out test set
sysuse auto, clear
set seed 42
* Reserve 20% as permanent test set
gen rand = runiform()
gen byte is_test = (rand > 0.8)
* Count training observations
quietly count if is_test == 0
local n_train = r(N)
* Generate a random ordering for training observations
gen train_order = runiform() if is_test == 0
sort is_test train_order
* Loop over increasing training set fractions
display "Learning Curve: RMSE by Training Set Size"
display "==========================================="
display "{txt} Train Size Train RMSE Test RMSE"
display "-------------------------------------------"
foreach frac in 0.2 0.4 0.6 0.8 1.0 {
local n_use = floor(`frac' * `n_train')
* Mark which training obs to use
gen byte use_train = (_n <= `n_use') & (is_test == 0)
* Fit model on subset
quietly reg price mpg weight length if use_train == 1
quietly predict double yhat
* Train RMSE
quietly gen double se_train = (price - yhat)^2 if use_train == 1
quietly summarize se_train
local rmse_train = sqrt(r(mean))
* Test RMSE
quietly gen double se_test = (price - yhat)^2 if is_test == 1
quietly summarize se_test
local rmse_test = sqrt(r(mean))
display %12.0f `n_use' %12.2f `rmse_train' %12.2f `rmse_test'
drop use_train yhat se_train se_test
}
# R: Custom learning curve function
library(randomForest)
library(rsample)
# Load data and split
data(mtcars)
set.seed(42)
split <- initial_split(mtcars, prop = 0.8)
train <- training(split)
test <- testing(split)
# Define training set fractions
fractions <- seq(0.3, 1.0, by = 0.1)
results <- data.frame(frac = fractions, n = integer(length(fractions)),
train_rmse = numeric(length(fractions)),
test_rmse = numeric(length(fractions)))
for (i in seq_along(fractions)) {
n_use <- floor(fractions[i] * nrow(train))
idx <- sample(nrow(train), n_use)
train_sub <- train[idx, ]
# Train model on subset
rf <- randomForest(mpg ~ ., data = train_sub, ntree = 100)
# Train RMSE
train_pred <- predict(rf, train_sub)
results$train_rmse[i] <- sqrt(mean((train_sub$mpg - train_pred)^2))
# Test RMSE
test_pred <- predict(rf, test)
results$test_rmse[i] <- sqrt(mean((test$mpg - test_pred)^2))
results$n[i] <- n_use
}
# Print results
cat("Learning Curve Summary\n")
cat("==========================================\n")
cat(sprintf("%12s %12s %12s\n", "Train Size", "Train RMSE", "Test RMSE"))
cat("------------------------------------------\n")
for (i in 1:nrow(results)) {
cat(sprintf("%12d %12.4f %12.4f\n", results$n[i], results$train_rmse[i], results$test_rmse[i]))
}
# Plot
plot(results$n, results$test_rmse, type = "o", col = "red",
xlab = "Training Set Size", ylab = "RMSE",
main = "Learning Curve - Random Forest",
ylim = c(0, max(results$test_rmse) * 1.1))
lines(results$n, results$train_rmse, type = "o", col = "blue")
legend("topright", legend = c("Training", "Test"),
col = c("blue", "red"), lty = 1, pch = 1)
Learning Curve Summary
=======================================================
Train Size Train RMSE Val RMSE
-------------------------------------------------------
1,651 0.1834 0.5823
3,302 0.1921 0.5547
4,953 0.1956 0.5388
6,604 0.1963 0.5285
8,256 0.1976 0.5218
9,907 0.1987 0.5166
11,558 0.1992 0.5121
13,209 0.1998 0.5092
14,860 0.2001 0.5068
16,512 0.2005 0.5047Learning Curve: RMSE by Training Set Size
===========================================
Train Size Train RMSE Test RMSE
-------------------------------------------
12 1847.35 3142.68
23 2012.48 2784.51
35 2098.63 2621.34
47 2134.17 2512.89
59 2156.42 2461.38Learning Curve Summary
==========================================
Train Size Train RMSE Test RMSE
------------------------------------------
7 1.2341 4.8712
10 1.3028 3.9456
12 1.3894 3.5218
15 1.4216 3.2847
17 1.4587 3.1025
20 1.4812 2.9483
22 1.5034 2.8156
25 1.5198 2.7134Model Comparison Workflow
In practice, model evaluation is not about computing a single metric for a single model. It is a systematic process of comparing multiple candidate models to select the best one for your task. A reliable model comparison workflow follows these steps:
- Split your data into training and test sets (or use nested cross-validation for small datasets). The test set is locked away until the very end.
- Define candidate models: OLS, LASSO, random forest, gradient boosting, neural network, etc. Include simple baselines (e.g., "always predict the mean") to ensure your model adds value.
- Tune hyperparameters for each candidate using cross-validation on the training set only. Each model gets its best hyperparameters.
- Compare cross-validated performance across models. Use the same folds for all models to ensure a fair comparison.
- Select the best model and evaluate it once on the test set. This is your final, unbiased performance estimate. Do not go back and try more models after seeing the test-set result.
Prediction Task Evaluation
- Goal: maximize out-of-sample prediction accuracy
- Metrics: RMSE, MAE, AUC-ROC, F1
- Method: cross-validation + held-out test set
- Flexibility is a virtue if it improves predictions
- Examples: forecasting, targeting, classification
Causal Task Evaluation
- Goal: estimate a causal effect (ATE, CATE, ATT)
- Metrics: bias, coverage of confidence intervals, MSE of the treatment effect estimate
- Method: simulation studies, placebo tests, sensitivity analysis
- Interpretability and identification matter more than raw fit
- Examples: policy evaluation, treatment effects, DiD
The distinction above is fundamental. The entire model evaluation framework in this module — cross-validation, RMSE, AUC-ROC, hyperparameter tuning — is designed for prediction tasks. When the goal is causal inference, the evaluation criteria are different: you care about whether the estimated treatment effect is unbiased and whether the confidence intervals have correct coverage, not about whether the model predicts the outcome well. A model can be an excellent predictor but a terrible estimator of causal effects, and vice versa. See Module 11D (Causal ML) for evaluation strategies specific to causal tasks.
Common Pitfalls
1. Data Leakage
Data leakage occurs when information from the test set (or from the future, or from the outcome itself) sneaks into the training process. It is the most common and most dangerous mistake in applied ML. Examples include: normalizing features using the mean and standard deviation of the entire dataset (including test observations), using variables that are consequences of the outcome rather than predictors, or including the outcome variable as a feature through an indirect path. Leakage produces models that look spectacular during evaluation but fail completely in deployment. The fix: every preprocessing step (scaling, encoding, imputation) must be fit on the training set only and then applied to the test set.
2. Ignoring Temporal Structure
If your data has a time dimension — panel data, time series, event data — a random train-test split is invalid. It allows the model to "peek into the future," using observations from 2022 to predict outcomes in 2019. This produces an optimistic performance estimate that does not reflect real-world deployment, where only past data is available. Always use a temporal split: train on data from earlier periods and test on later periods. For cross-validation, use time-series CV (expanding or sliding window), not standard K-fold.
3. Choosing the Wrong Metric
The choice of evaluation metric should reflect the costs of different types of errors in your application. Accuracy is almost always the wrong metric for imbalanced classification. RMSE penalizes large errors heavily, which may or may not be appropriate. MAPE is undefined for zero values. Using the wrong metric means optimizing for the wrong goal: you may select a model that is "best" according to a metric that does not align with the actual decision problem. Always ask: "What does a prediction error cost in this application?" and choose metrics accordingly.
References
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. (Available free at hastie.su.domains/ElemStatLearn)
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. (Available free at statlearning.com)
- Athey, S., & Imbens, G. W. (2019). Machine Learning Methods That Economists Should Know About. Annual Review of Economics, 11, 685-725.
- Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281-305.
- Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.