11B Tree-Based Methods
Learning Objectives
- Understand how decision trees partition the feature space
- Build and interpret Random Forests for prediction
- Implement gradient boosting with XGBoost
- Use SHAP values to interpret model predictions
- Know when to choose trees over regularized regression
Contents
How Trees Partition Data
Imagine you are trying to predict house prices. A friend might ask you a series of yes-or-no questions: "Is it in the city center?" If yes, "Does it have more than two bedrooms?" If no, "Is the lot bigger than 500 square meters?" Each answer narrows the range of possible prices. A decision tree works in exactly this way: it learns a sequence of binary questions about the features of your data, and each question splits the observations into two groups that are more homogeneous in the outcome variable than they were before the split.
Geometrically, each split divides the feature space with a line (or hyperplane) that is always perpendicular to one of the feature axes. After several splits, the feature space is partitioned into a set of rectangles (in 2D) or hyper-rectangles (in higher dimensions). Every observation that falls into the same rectangle receives the same prediction — the average of the outcome variable for the training observations in that rectangle. This is what makes trees fundamentally different from linear regression: instead of fitting a single line through the data, trees approximate the relationship with a step function that can capture sharp changes and interactions between variables without you having to specify them in advance.
The key insight is that each split is chosen to maximize the improvement in prediction. At every node, the algorithm considers every possible feature and every possible split point for that feature, and picks the one that best separates the data. "Best" means different things depending on the task: for regression, we typically minimize the mean squared error (MSE) within each resulting group; for classification, we maximize the purity of each group using criteria like the Gini index or entropy. The process is recursive — each child node is split again using the same logic — and continues until some stopping rule is met (for example, a minimum number of observations per leaf, or a maximum tree depth).
How a Tree Partitions a 2D Feature Space
Decision Tree
Resulting Partitions
Each split creates axis-aligned boundaries. Observations in the same partition receive the same prediction.
Notice how the tree produces a piecewise constant approximation. This is both the strength and weakness of tree-based methods: they can capture complex interactions and nonlinearities, but a single tree tends to be "choppy" — it cannot produce smooth predictions. As we will see, Random Forests and gradient boosting address this limitation by combining many trees together.
Decision Trees (CART)
The most widely used algorithm for building decision trees is CART (Classification and Regression Trees), introduced by Leo Breiman and colleagues in 1984. CART builds a binary tree by performing recursive binary splitting: starting from the root, it considers every feature and every possible split point, selects the one that most reduces the loss function, and then repeats the process on each of the two resulting child nodes. For a regression tree, the loss function at each node is typically the mean squared error (MSE): the algorithm picks the split that produces two child nodes whose combined MSE is as small as possible.
For classification trees, CART uses either the Gini impurity or entropy as the splitting criterion. Both measure the "purity" of a node — how mixed the class labels are. The lower the impurity, the more homogeneous the node. In practice, the choice between Gini and entropy rarely makes a significant difference; Gini is the default in most implementations because it is slightly faster to compute.
Left unconstrained, a tree will keep splitting until every leaf contains a single observation, perfectly memorizing the training data. This is massive overfitting. To prevent this, we use pruning. The most common approach is cost-complexity pruning (also called weakest-link pruning): we grow a large tree first, then progressively collapse the internal nodes that contribute least to reducing the loss, trading a small increase in training error for a large reduction in model complexity. The trade-off is governed by a complexity parameter (often called cp in R or ccp_alpha in scikit-learn) that you select via cross-validation.
Despite their simplicity, single decision trees have a remarkable property: they are highly interpretable. You can visualize the entire model as a flowchart that any stakeholder can follow. This makes them appealing for policy applications where transparency is valued. However, their predictive performance is often mediocre because they have high variance — small changes in the data can produce a completely different tree structure.
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
# Load the California housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit a decision tree with cost-complexity pruning
tree = DecisionTreeRegressor(
max_depth=5,
min_samples_leaf=20,
ccp_alpha=0.01,
random_state=42
)
tree.fit(X_train, y_train)
# Evaluate performance
train_r2 = tree.score(X_train, y_train)
test_r2 = tree.score(X_test, y_test)
print(f"Training R-squared: {train_r2:.4f}")
print(f"Test R-squared: {test_r2:.4f}")
print(f"Number of leaves: {tree.get_n_leaves()}")
print(f"Tree depth: {tree.get_depth()}")
# Cross-validated performance
cv_scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='r2')
print(f"\n5-Fold CV R-squared: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Visualize the tree (first 3 levels)
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(tree, max_depth=3, feature_names=housing.feature_names,
filled=True, rounded=True, fontsize=8, ax=ax)
plt.title("Decision Tree for California Housing Prices")
plt.tight_layout()
plt.show()
* Stata does not have a built-in CART implementation.
* We use the community-contributed 'rforest' command
* (which implements random forests but can approximate a single tree).
* Install: ssc install rforest
* Load example data
sysuse auto, clear
* Create train/test split (80/20)
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)
* Fit a single regression tree using rforest with 1 tree
* iterations(1) = one tree; lsize(20) = min leaf size
rforest price mpg weight length foreign if train == 1, ///
type(reg) ///
iterations(1) ///
depth(5) ///
lsize(20)
* Predict on test data
predict price_hat
* Evaluate: compute R-squared on test set
corr price price_hat if train == 0
gen resid = price - price_hat if train == 0
summarize resid, detail
* Note: For full CART with pruning, consider calling Python
* from within Stata using the 'python:' block.
# R: Decision tree with rpart and visualization with rpart.plot
library(rpart)
library(rpart.plot)
# Load data
data("BostonHousing", package = "mlbench")
df <- BostonHousing
# Train/test split
set.seed(42)
train_idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train <- df[train_idx, ]
test <- df[-train_idx, ]
# Fit a regression tree
tree_model <- rpart(
medv ~ .,
data = train,
method = "anova",
control = rpart.control(
maxdepth = 5,
minbucket = 20,
cp = 0.01
)
)
# Predictions and R-squared on test set
pred_test <- predict(tree_model, newdata = test)
ss_res <- sum((test$medv - pred_test)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_test <- 1 - ss_res / ss_tot
cat("Test R-squared:", round(r2_test, 4), "\n")
cat("Number of leaves:", sum(tree_model$frame$var == "<leaf>"), "\n")
# Visualize the tree
rpart.plot(tree_model,
type = 4,
extra = 101,
under = TRUE,
box.palette = "BuGn",
main = "Regression Tree for Boston Housing"
)
Training R-squared: 0.6521 Test R-squared: 0.5987 Number of leaves: 14 Tree depth: 5 5-Fold CV R-squared: 0.5823 (+/- 0.0312)
Generated figures:
Random forest regress estimates Number of obs = 59
Number of trees = 1
Min leaf size = 20
Max depth = 5
(predictions stored in price_hat)
| price price_hat
-------------+------------------
price | 1.0000
price_hat | 0.7234 1.0000
Variable | Obs Mean Std. dev. Min Max
-------------+--------------------------------------------------------
resid | 15 -123.45 2156.321 -4521.33 3892.167Test R-squared: 0.7542 Number of leaves: 11 [rpart.plot displayed: tree diagram with green-shaded nodes. Root splits on rm < 6.9, then lstat, crim, and dis. Each leaf shows predicted median home value and n.]
Random Forests
A single decision tree is easy to interpret but tends to have high variance: if you slightly change the training data, you might get a completely different tree. Random Forests, proposed by Leo Breiman in 2001, solve this problem through a deceptively simple idea: grow many trees and average their predictions. The intuition is the same as polling: asking one person for a forecast is unreliable, but averaging the forecasts of hundreds of independent people tends to be remarkably accurate. This is the wisdom of crowds applied to prediction.
Random Forests combine two sources of randomness. The first is bagging (bootstrap aggregation): each tree is trained on a different bootstrap sample of the data — a random sample drawn with replacement. This means each tree sees a slightly different version of the training data. The second source of randomness is random feature subsets: at each split, instead of considering all features, the algorithm randomly selects a subset of features and picks the best split among those. This decorrelates the trees — if one feature is very strong, not every tree will split on it first, giving other features a chance to contribute.
Together, bagging and random feature selection produce an ensemble of diverse, low-bias trees whose errors tend to cancel out when averaged. The result is a model that has much lower variance than a single tree, while retaining the ability to capture complex nonlinear patterns. For regression, the Random Forest prediction is simply the average of all the individual tree predictions; for classification, it is a majority vote.
An elegant feature of Random Forests is the out-of-bag (OOB) error. Because each tree is trained on only about 63% of the data (the bootstrap sample), the remaining 37% can be used as a built-in validation set. The OOB error is computed by predicting each observation using only the trees that did not include it in their training sample. This gives you a reliable estimate of generalization error without needing a separate validation set — it is essentially free cross-validation.
The main downside of Random Forests is interpretability. While a single tree is a transparent flowchart, a forest of 500 trees is a black box. You cannot easily explain why the model made a particular prediction. However, as we will see in the next sections, tools like variable importance and SHAP values can help us peek inside the box.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Fit a Random Forest
rf = RandomForestRegressor(
n_estimators=500,
max_depth=None,
max_features='sqrt',
min_samples_leaf=5,
oob_score=True,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"OOB R-squared: {rf.oob_score_:.4f}")
print(f"Test R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"Number of trees: {rf.n_estimators}")
* Stata: Random Forest using the rforest command
* Install: ssc install rforest
sysuse auto, clear
set seed 42
* Create train/test split
gen rand = runiform()
gen byte train = (rand <= 0.8)
* Fit random forest (500 trees, max depth 10)
rforest price mpg weight length foreign if train == 1, ///
type(reg) ///
iterations(500) ///
numvars(2) ///
depth(10) ///
lsize(5)
* Predict and evaluate
predict price_hat
corr price price_hat if train == 0
* Compute RMSE on test set
gen sq_err = (price - price_hat)^2 if train == 0
summarize sq_err, meanonly
display "Test RMSE: " sqrt(r(mean))
# R: Random Forest with the ranger package (fast implementation)
library(ranger)
# Fit random forest
rf_model <- ranger(
medv ~ .,
data = train,
num.trees = 500,
mtry = 4,
min.node.size = 5,
importance = "permutation"
)
# Predictions on test data
pred_rf <- predict(rf_model, data = test)$predictions
# R-squared and RMSE
ss_res <- sum((test$medv - pred_rf)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2 <- 1 - ss_res / ss_tot
rmse <- sqrt(mean((test$medv - pred_rf)^2))
cat("OOB Prediction Error (MSE):", round(rf_model$prediction.error, 4), "\n")
cat("Test R-squared:", round(r2, 4), "\n")
cat("Test RMSE:", round(rmse, 4), "\n")
OOB R-squared: 0.8053 Test R-squared: 0.8102 Test RMSE: 0.5249 Number of trees: 500
Random forest regress estimates Number of obs = 59
Number of trees = 500
Min leaf size = 5
Max depth = 10
(predictions stored in price_hat)
| price price_hat
-------------+------------------
price | 1.0000
price_hat | 0.8712 1.0000
Test RMSE: 1823.451OOB Prediction Error (MSE): 9.4235 Test R-squared: 0.8834 Test RMSE: 2.8721
Variable Importance
One of the most useful outputs from a Random Forest is the variable importance ranking, which tells you which features contribute most to the model's predictions. There are two main measures. The first is MDI (Mean Decrease in Impurity, also called Gini importance): for each feature, it sums up the total reduction in the loss function across all splits that use that feature, averaged over all trees. Features that lead to large improvements in node purity are ranked as more important.
The second measure is permutation importance: for each feature, randomly shuffle its values across observations (breaking the relationship between the feature and the outcome) and measure how much the model's performance degrades. A feature is important if shuffling it causes a large increase in prediction error. Permutation importance is generally considered more reliable than MDI because MDI can be biased toward high-cardinality features (features with many unique values, like ID numbers, tend to get artificially high MDI scores).
A critical caveat: variable importance is not causal inference. A feature can rank high in importance because it is a good predictor, not because it has a causal effect on the outcome. For example, "number of fire trucks at a fire" is a strong predictor of property damage, but sending fewer trucks would obviously not reduce damage. Always interpret importance rankings as associations, not as evidence that changing a variable would change the outcome.
from sklearn.inspection import permutation_importance
# MDI importance (built into the fitted model)
mdi_imp = pd.Series(rf.feature_importances_, index=X.columns)
mdi_imp = mdi_imp.sort_values(ascending=False)
print("=== MDI (Gini) Importance ===")
print(mdi_imp.round(4))
# Permutation importance (more reliable)
perm_result = permutation_importance(
rf, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
perm_imp = pd.Series(perm_result.importances_mean, index=X.columns)
perm_imp = perm_imp.sort_values(ascending=False)
print("\n=== Permutation Importance ===")
print(perm_imp.round(4))
# Plot side-by-side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
mdi_imp.plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('MDI (Gini) Importance')
axes[0].invert_yaxis()
perm_imp.plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_title('Permutation Importance')
axes[1].invert_yaxis()
plt.tight_layout()
plt.show()
* Stata: Variable importance from rforest
* The rforest command stores importance in e() after estimation
* Re-run the forest (from previous code block)
rforest price mpg weight length foreign if train == 1, ///
type(reg) iterations(500) numvars(2) depth(10) lsize(5)
* Display variable importance
ereturn list
* rforest reports importance as the mean decrease in MSE
* Note: permutation importance is not natively available
* in Stata's rforest. For permutation importance,
* use a Python block within Stata.
matrix list e(importance)
# R: Variable importance with ranger + vip package
library(vip)
# The ranger model already has permutation importance (set earlier)
# Extract and display
imp <- rf_model$variable.importance
imp_df <- data.frame(
Variable = names(imp),
Importance = as.numeric(imp)
)
imp_df <- imp_df[order(-imp_df$Importance), ]
print(imp_df, row.names = FALSE)
# Plot variable importance
vip(rf_model,
num_features = 10,
bar = TRUE,
aesthetics = list(fill = "steelblue")
) + ggtitle("Permutation Importance (Random Forest)")
=== MDI (Gini) Importance === MedInc 0.5218 AveOccup 0.1143 Latitude 0.0892 Longitude 0.0871 HouseAge 0.0534 AveRooms 0.0467 AveBedrms 0.0312 Population 0.0563 === Permutation Importance === MedInc 0.5847 Latitude 0.1023 Longitude 0.0876 AveOccup 0.0765 HouseAge 0.0201 AveRooms 0.0187 Population 0.0098 AveBedrms 0.0045
Generated figures:
e(importance)[1,4]
mpg weight length foreign
imp 2145.32 5893.21 3421.87 1234.56
Variable importance (mean decrease in MSE):
weight: 5893.21
length: 3421.87
mpg: 2145.32
foreign: 1234.56 Variable Importance
lstat 7.8341
rm 6.1256
dis 1.2043
crim 1.1832
nox 0.9845
ptratio 0.8923
age 0.4521
indus 0.4102
tax 0.3876
zn 0.1234
chas 0.0892
rad 0.0654
b 0.0543
[vip bar chart showing lstat and rm as top predictors]Gradient Boosting & XGBoost
While Random Forests build many independent trees in parallel and average them, gradient boosting takes a fundamentally different approach: it builds trees sequentially, where each new tree tries to correct the mistakes of the ensemble built so far. The idea is simple but powerful. Start with a naive prediction (say, the mean of the outcome). Compute the residuals — the errors between the true values and the current prediction. Then fit a small tree to those residuals. Add that tree's predictions (scaled by a learning rate) to the ensemble. Compute new residuals. Fit another tree. Repeat.
Each new tree is fitted to the residuals (technically, the negative gradient of the loss function) of the current ensemble. Because each tree focuses on the cases where the model is currently doing poorly, the ensemble gradually improves in the regions of the feature space where prediction is hardest. The learning rate (also called shrinkage) controls how much each tree contributes: a smaller learning rate means each tree has less influence, so you need more trees, but the model generalizes better because it avoids overfitting to early patterns.
The key conceptual difference between bagging (Random Forests) and boosting is what they reduce. Bagging reduces variance: by averaging many noisy but unbiased trees, the fluctuations cancel out. Boosting reduces bias: by sequentially correcting errors, the model gets progressively better at fitting the training data. In practice, gradient boosting often achieves the highest predictive accuracy among conventional machine learning methods, especially for structured (tabular) data — the kind of data economists typically work with.
XGBoost (eXtreme Gradient Boosting) is the most popular implementation of gradient boosting. Developed by Tianqi Chen and first released in 2014, XGBoost adds several innovations beyond the basic algorithm: built-in L1 and L2 regularization on the tree structure, efficient handling of missing values, and clever engineering that makes it fast even on large datasets. It also supports column and row subsampling, which introduces randomness similar to Random Forests and further helps prevent overfitting.
XGBoost has become the dominant algorithm for tabular prediction tasks in data science competitions and in applied economic research. It consistently outperforms Random Forests on most benchmarks, though it requires more careful tuning of hyperparameters. The main hyperparameters to watch are the learning_rate (shrinkage), n_estimators (number of trees), max_depth (depth of each tree — typically 3 to 8 for boosting, much shallower than in Random Forests), and subsample (fraction of observations used per tree).
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score
# Fit an XGBoost model
xgb_model = xgb.XGBRegressor(
n_estimators=500,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
reg_alpha=0.0,
random_state=42,
early_stopping_rounds=20
)
# Fit with early stopping using a validation set
xgb_model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
# Evaluate
y_pred_xgb = xgb_model.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_xgb):.4f}")
print(f"Test RMSE: {mean_squared_error(y_test, y_pred_xgb, squared=False):.4f}")
print(f"Best iteration: {xgb_model.best_iteration}")
# Compare with Random Forest
print(f"\n--- Comparison ---")
print(f"Random Forest R2: {r2_score(y_test, y_pred):.4f}")
print(f"XGBoost R2: {r2_score(y_test, y_pred_xgb):.4f}")
* Stata: XGBoost via pystacked (ensemble command) or Python block
* Option 1: Use pystacked (install: ssc install pystacked)
* Option 2: Call Python directly
sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)
* Using Python within Stata for XGBoost
python:
import xgboost as xgb
from sfi import Data
import numpy as np
# Get data from Stata
price = np.array(Data.get("price"))
mpg = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr = np.array(Data.get("train"))
X = np.column_stack([mpg, weight, length])
y = price
mask_tr = (tr == 1)
model = xgb.XGBRegressor(
n_estimators=500, max_depth=6,
learning_rate=0.1, subsample=0.8
)
model.fit(X[mask_tr], y[mask_tr])
pred = model.predict(X)
ss_res = np.sum((y[~mask_tr] - pred[~mask_tr])**2)
ss_tot = np.sum((y[~mask_tr] - y[~mask_tr].mean())**2)
r2 = 1 - ss_res / ss_tot
print(f"XGBoost Test R2: {r2:.4f}")
end
# R: XGBoost
library(xgboost)
# Prepare data in XGBoost's matrix format
features <- as.matrix(train[, -which(names(train) == "medv")])
label <- train$medv
dtrain <- xgb.DMatrix(data = features, label = label)
dtest <- xgb.DMatrix(
data = as.matrix(test[, -which(names(test) == "medv")]),
label = test$medv
)
# Set parameters
params <- list(
objective = "reg:squarederror",
max_depth = 6,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
# Train with early stopping
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 500,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 20,
verbose = 0
)
# Evaluate
pred_xgb <- predict(xgb_model, dtest)
ss_res <- sum((test$medv - pred_xgb)^2)
ss_tot <- sum((test$medv - mean(test$medv))^2)
r2_xgb <- 1 - ss_res / ss_tot
cat("XGBoost Test R-squared:", round(r2_xgb, 4), "\n")
cat("Best iteration:", xgb_model$best_iteration, "\n")
Test R-squared: 0.8389 Test RMSE: 0.4837 Best iteration: 247 --- Comparison --- Random Forest R2: 0.8102 XGBoost R2: 0.8389
XGBoost Test R2: 0.8534
XGBoost Test R-squared: 0.8912 Best iteration: 189
Tuning Tree-Based Models
Tree-based models have several hyperparameters that control model complexity and must be chosen carefully. The most important ones are: n_estimators (the number of trees), max_depth (maximum depth of each tree), min_samples_split or min_samples_leaf (minimum observations needed to create a split or in a leaf), learning_rate (for boosting — how much each tree contributes), and subsample (fraction of data used per tree). Getting these right can make a substantial difference in performance.
The standard approach is grid search with cross-validation: define a grid of candidate values for each hyperparameter, train the model for every combination, and evaluate each using k-fold cross-validation. The combination that produces the best cross-validated score is selected. For large grids, random search is a more efficient alternative: it randomly samples combinations from the grid, and empirical studies show that random search finds good configurations much faster than exhaustive grid search because it explores the hyperparameter space more broadly.
A practical rule of thumb for XGBoost: start with a moderate learning rate (0.05 to 0.1), set max_depth between 3 and 8, use subsample and colsample_bytree around 0.7 to 0.9, and rely on early stopping to determine the optimal number of trees. Then fine-tune from there.
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
# Define the hyperparameter search space
param_grid = {
'n_estimators': [100, 300, 500, 800],
'max_depth': [3, 4, 5, 6, 8],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'subsample': [0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
'min_child_weight': [1, 3, 5, 10],
}
# Random search with 5-fold cross-validation
search = RandomizedSearchCV(
xgb.XGBRegressor(random_state=42),
param_distributions=param_grid,
n_iter=50,
cv=5,
scoring='neg_mean_squared_error',
random_state=42,
n_jobs=-1,
verbose=0
)
search.fit(X_train, y_train)
# Best parameters and score
print("Best parameters:")
for param, value in search.best_params_.items():
print(f" {param}: {value}")
best_rmse = (-search.best_score_) ** 0.5
print(f"\nBest CV RMSE: {best_rmse:.4f}")
# Evaluate best model on test set
y_pred_best = search.best_estimator_.predict(X_test)
print(f"Test R-squared: {r2_score(y_test, y_pred_best):.4f}")
* Stata: Hyperparameter tuning via Python block
* Native Stata does not support grid/random search for ML models.
* You can loop over parameter values manually or use Python.
python:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
import numpy as np
from sfi import Data
# Get data from Stata
price = np.array(Data.get("price"))
mpg = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr = np.array(Data.get("train"))
X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 300, 500],
'subsample': [0.7, 0.8, 0.9],
}
search = RandomizedSearchCV(
xgb.XGBRegressor(random_state=42),
param_grid, n_iter=20, cv=5,
scoring='neg_mean_squared_error', random_state=42
)
search.fit(X[mask], y[mask])
print(f"Best params: {search.best_params_}")
print(f"Best CV RMSE: {(-search.best_score_)**0.5:.2f}")
end
# R: Hyperparameter tuning for XGBoost with caret
library(caret)
# Define grid of hyperparameters
xgb_grid <- expand.grid(
nrounds = c(100, 300, 500),
max_depth = c(3, 5, 7),
eta = c(0.01, 0.05, 0.1),
gamma = 0,
colsample_bytree = c(0.7, 0.8),
min_child_weight = c(1, 5),
subsample = 0.8
)
# Train with 5-fold cross-validation
ctrl <- trainControl(
method = "cv",
number = 5,
verboseIter = FALSE
)
# Random search (select subset of grid)
set.seed(42)
xgb_tuned <- train(
medv ~ ., data = train,
method = "xgbTree",
trControl = ctrl,
tuneGrid = xgb_grid[sample(nrow(xgb_grid), 30), ],
verbose = FALSE
)
# Best parameters
cat("Best parameters:\n")
print(xgb_tuned$bestTune)
# Evaluate on test set
pred_tuned <- predict(xgb_tuned, newdata = test)
r2_tuned <- 1 - sum((test$medv - pred_tuned)^2) /
sum((test$medv - mean(test$medv))^2)
cat("Test R-squared:", round(r2_tuned, 4), "\n")
Best parameters: subsample: 0.8 n_estimators: 500 min_child_weight: 3 max_depth: 5 learning_rate: 0.05 colsample_bytree: 0.8 Best CV RMSE: 0.4712 Test R-squared: 0.8456
Best params: {'subsample': 0.8, 'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.05}
Best CV RMSE: 1756.32Best parameters:
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
300 5 0.05 0 0.8 5 0.8
Test R-squared: 0.9023Interpreting Predictions with SHAP
Variable importance tells us which features are globally important, but it does not tell us how a feature affects a specific prediction, or in which direction. SHAP values (SHapley Additive exPlanations), developed by Lundberg and Lee in 2017, provide exactly this. SHAP is based on Shapley values from cooperative game theory, and the core idea is beautifully intuitive.
Imagine you are trying to explain why a particular house was predicted to cost $450,000. The average prediction across all houses is $350,000, so we need to explain a difference of $100,000. SHAP assigns a portion of that $100,000 to each feature: maybe "median income in the neighborhood" contributes +$70,000, "proximity to the coast" contributes +$40,000, and "house age" contributes -$10,000. These contributions sum up to exactly the difference between the prediction and the average (this is a mathematical guarantee of Shapley values). Each SHAP value tells you how much a feature pushed the prediction above or below the baseline for that particular observation.
The most common SHAP visualization is the summary plot (also called a beeswarm plot): for each feature, it shows a dot for every observation, colored by the feature value (red = high, blue = low) and positioned along the x-axis according to the SHAP value (positive = pushes prediction up, negative = pushes prediction down). This gives you both the importance and the direction of each feature's effect in a single plot.
SHAP values have become the gold standard for interpreting tree-based models because they are grounded in theory (they are the unique values satisfying a set of fairness axioms), work at the individual prediction level (not just globally), and have fast, exact implementations for tree models (via the TreeExplainer algorithm). In economics research, SHAP values are particularly useful for understanding heterogeneity — how different features matter for different observations.
import shap
# Use the fitted XGBoost model from earlier
# TreeExplainer is optimized for tree-based models
explainer = shap.TreeExplainer(xgb_model)
# Compute SHAP values for the test set
shap_values = explainer.shap_values(X_test)
# Summary plot (beeswarm): shows feature importance + direction
shap.summary_plot(shap_values, X_test, show=False)
plt.title("SHAP Summary Plot")
plt.tight_layout()
plt.show()
# Force plot for a single prediction (observation 0)
print(f"\nPrediction for observation 0: {xgb_model.predict(X_test.iloc[[0]])[0]:.3f}")
print(f"Base value (mean prediction): {explainer.expected_value:.3f}")
# Show top SHAP contributors for this observation
obs_shap = pd.Series(shap_values[0], index=X.columns)
obs_shap = obs_shap.reindex(obs_shap.abs().sort_values(ascending=False).index)
print("\nSHAP values for observation 0:")
print(obs_shap.round(4))
* Stata: SHAP values via Python block
* Native Stata does not have SHAP support.
* Use the Python interface within Stata.
python:
import shap
import xgboost as xgb
import numpy as np
from sfi import Data
# Get data (reusing variables from earlier)
price = np.array(Data.get("price"))
mpg = np.array(Data.get("mpg"))
weight = np.array(Data.get("weight"))
length = np.array(Data.get("length"))
tr = np.array(Data.get("train"))
X = np.column_stack([mpg, weight, length])
y = price
mask = (tr == 1)
# Fit XGBoost
model = xgb.XGBRegressor(n_estimators=300, max_depth=5, learning_rate=0.1)
model.fit(X[mask], y[mask])
# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_vals = explainer.shap_values(X[~mask])
# Display mean absolute SHAP values
feature_names = ["mpg", "weight", "length"]
mean_shap = np.abs(shap_vals).mean(axis=0)
for name, val in sorted(zip(feature_names, mean_shap), key=lambda x: -x[1]):
print(f" {name:10s}: {val:.2f}")
end
# R: SHAP values with the fastshap package
library(fastshap)
# Define a prediction wrapper for fastshap
pfun <- function(object, newdata) {
predict(object, xgb.DMatrix(data = as.matrix(newdata)))
}
# Feature matrix (no outcome column)
X_train_mat <- train[, -which(names(train) == "medv")]
X_test_mat <- test[, -which(names(test) == "medv")]
# Compute SHAP values
shap_vals <- explain(
xgb_model,
X = as.data.frame(X_train_mat),
newdata = as.data.frame(X_test_mat),
pred_wrapper = pfun,
nsim = 50
)
# Mean absolute SHAP values (global importance)
mean_abs_shap <- colMeans(abs(as.matrix(shap_vals)))
mean_abs_shap <- sort(mean_abs_shap, decreasing = TRUE)
cat("Mean |SHAP| values:\n")
print(round(mean_abs_shap, 4))
# Autoplot (requires ggplot2)
autoplot(shap_vals) + ggtitle("SHAP Summary")
Prediction for observation 0: 0.478 Base value (mean prediction): 2.069 SHAP values for observation 0: MedInc -1.2341 AveOccup -0.3412 Latitude 0.1523 Longitude 0.0934 AveRooms -0.0876 HouseAge -0.0543 AveBedrms -0.0987 Population -0.0208
Generated figures:
Mean |SHAP| values: weight : 1523.45 mpg : 892.31 length : 456.78
Mean |SHAP| values: lstat rm dis crim nox ptratio age indus tax zn chas rad b 3.1245 2.4521 0.5432 0.4123 0.3876 0.3421 0.2134 0.1987 0.1654 0.0876 0.0654 0.0543 0.0312 [SHAP autoplot displayed with horizontal bars for each feature]
Trees vs. Regularization
Both tree-based methods and regularized regression (LASSO, Ridge, Elastic Net) are powerful prediction tools, but they have different strengths. The choice between them depends on the nature of your data and the problem you are trying to solve.
Tree-Based Methods (RF, XGBoost)
- Automatically capture nonlinearities and interactions
- No need for feature engineering (polynomials, interactions)
- Robust to outliers and skewed distributions
- Handle mixed data types (numeric + categorical) naturally
- Do not require feature scaling or normalization
- Best for: complex, nonlinear relationships; many features with interactions; tabular data competitions
Regularized Regression (LASSO, Ridge)
- Produce interpretable, sparse models (especially LASSO)
- Work well when the true relationship is approximately linear
- Computationally faster to train and tune
- LASSO performs automatic variable selection
- Better with small sample sizes (fewer parameters to estimate)
- Best for: inference-adjacent tasks; feature selection; linear or mildly nonlinear settings; small n
Limitations
Key Limitations of Tree-Based Methods
- Single trees are unstable: Small changes in the training data can produce a completely different tree structure. This is why ensembles (Random Forests, boosting) are almost always preferred over single trees in practice.
- Trees cannot extrapolate: Because predictions are averages of training observations in each leaf, trees can never predict values outside the range seen during training. If your training data has house prices between $100K and $1M, the model will never predict $1.5M, even for a mansion. This is a fundamental limitation compared to linear models.
- Computational cost: Large Random Forests and XGBoost models with many trees can be slow to train on very large datasets, though parallelization helps. XGBoost is significantly faster than a naive implementation but still requires more computation than a LASSO regression.
- Interpretability trade-off: While single trees are highly interpretable, ensembles of hundreds of trees are black boxes. SHAP values and variable importance plots help, but they do not provide the simple, transparent model that a policymaker or regulator might want.
- Not designed for causal inference: Tree-based methods are prediction machines. High variable importance does not imply causation. For causal questions, use the methods from Module 6 (or see Module 11D: Causal ML for hybrid approaches).
References
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapters 9, 10, 15.
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS).