11C Neural Networks
Learning Objectives
- Understand how neural networks extend linear models with nonlinear activation functions
- Build a feedforward neural network from scratch in Python (PyTorch)
- Train networks using backpropagation and gradient descent
- Apply regularization techniques (dropout, early stopping) to prevent overfitting
- Know when neural networks are (and aren't) the right choice for economics research
Contents
From Linear Models to Neural Networks
You already know linear regression: the predicted outcome is a weighted sum of the features, y = Xβ. This is powerful and interpretable, but it can only learn linear relationships. If the true relationship between your features and the outcome is nonlinear (for example, the effect of education on earnings is different at different levels of education), a linear model will miss it unless you manually engineer the right transformations -- adding squared terms, interactions, polynomials, and so on.
The Factory Assembly Line Analogy
A neural network is like a factory assembly line. Raw materials (your input data -- income, education, age) enter at one end. They pass through multiple processing stages (layers), each one transforming the data a little bit -- combining features, filtering out noise, detecting patterns. At the end of the line, a finished product (a prediction) comes out. Just as a car factory turns raw steel and rubber into a car through a sequence of specialized stations, a neural network turns raw numbers into a prediction through a sequence of learned transformations. The key difference from a linear model is that each station can do nonlinear things -- bending, shaping, and combining -- rather than just adding things up.
A neural network automates the process of feature engineering. The key idea is to stack multiple linear transformations with nonlinear activation functions between them. Take your features X, multiply by a weight matrix W1, add a bias b1, then apply a nonlinear function (like ReLU, which replaces negative values with zero). The result is a new set of "features" -- called a hidden layer. Then take those hidden features, multiply by another weight matrix W2, add another bias b2, and you get the output. By chaining together several of these layers, the network can learn arbitrarily complex nonlinear functions of the input features.
This leads to one of the most remarkable results in mathematics: the Universal Approximation Theorem. It states that a neural network with just one hidden layer and enough neurons can approximate any continuous function to any desired degree of accuracy. In other words, neural networks are flexible enough to learn virtually any pattern in the data -- if you have enough data and computational resources. The practical challenge is finding the right weights, which is where training algorithms come in.
Think of it this way: a linear model draws a single straight line (or plane) through your data. A neural network draws a wiggly, curved surface that can follow the contours of the data as closely as needed. Each hidden layer adds another level of abstraction -- the first layer might learn simple features (like "is this value high or low?"), the second layer might combine those into more complex features (like "is this a pattern of high income AND low education?"), and so on. This hierarchical feature learning is what makes neural networks so powerful for complex data.
A Single Neuron (Perceptron)
Output = σ(w₁x₁ + w₂x₂ + w₃x₃ + b), where σ is a nonlinear activation function
Network Architecture
A feedforward neural network is organized into layers. The input layer receives the features (one neuron per feature). One or more hidden layers perform the transformations. The output layer produces the final prediction. Each neuron in a layer receives inputs from every neuron in the previous layer (this is why these are also called "fully connected" or "dense" layers), applies a linear transformation (weighted sum + bias), and passes the result through an activation function.
Feedforward Neural Network Architecture
Every neuron in one layer connects to every neuron in the next layer ("fully connected"). Each connection has a learnable weight.
The choice of activation function is crucial. Without it, stacking layers would be pointless -- a composition of linear functions is just another linear function. An activation function is like a filter -- it decides which signals to pass through and which to dampen. This nonlinearity is what gives neural networks their power.
| Function | Formula | Range | When to Use |
|---|---|---|---|
| ReLU | max(0, z) |
[0, +inf) | Default for hidden layers. Fast, simple, works well in practice. |
| Sigmoid | 1 / (1 + e^(-z)) |
(0, 1) | Output layer for binary classification (probability output). |
| Tanh | (e^z - e^(-z)) / (e^z + e^(-z)) |
(-1, 1) | Hidden layers when centered output is desired. Often works better than sigmoid. |
| Linear (none) | z |
(-inf, +inf) | Output layer for regression (continuous prediction). |
Activation Functions Visualized
Two key design choices are the width (number of neurons per layer) and the depth (number of layers). Wider layers can learn more complex patterns within a single transformation, while deeper networks can learn hierarchical representations -- simple patterns in early layers composed into complex patterns in later layers. In practice, modern networks tend to favor depth over width. A common starting point for tabular economic data is 2-3 hidden layers with 64-256 neurons each.
A subtle but important point: adding more layers and neurons always increases the model's capacity (the set of functions it can represent), but it also increases the risk of overfitting. This is why regularization (Section 4) is essential when training neural networks.
Common Layer Types and What They Do
| Linear / Dense | The basic building block. Every input connects to every output (nn.Linear in PyTorch). Used for tabular data. |
| Conv2d (Convolutional) | Slides a small filter across an image to detect local patterns (edges, textures). Used for image data -- for example, classifying satellite imagery to estimate regional GDP. |
| LSTM / GRU (Recurrent) | Processes sequential data one step at a time while maintaining a "memory" of previous steps. Used for time series -- for example, forecasting inflation from monthly economic indicators. |
| Transformer / Attention | Processes all positions in a sequence simultaneously, learning which parts to "attend to." The architecture behind GPT, BERT, and modern NLP -- for example, analyzing central bank communications. |
import torch
import torch.nn as nn
# Define a feedforward neural network with PyTorch
# Architecture: 8 inputs -> 128 -> 64 -> 32 -> 1 output
model = nn.Sequential(
# First hidden layer
nn.Linear(8, 128),
nn.ReLU(), # Activation: max(0, z)
# Second hidden layer
nn.Linear(128, 64),
nn.ReLU(),
# Third hidden layer
nn.Linear(64, 32),
nn.ReLU(),
# Output layer (linear activation for regression)
nn.Linear(32, 1)
)
# Print the architecture summary
print(model)
# Count total trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal trainable parameters: {total_params:,}")
# Breakdown: each Linear(in, out) has in*out weights + out biases
for name, param in model.named_parameters():
print(f" {name:15s} shape: {str(list(param.shape)):15s} params: {param.numel():,}")
* Stata: Neural network definition via Python block
* Stata does not have native neural network support.
* Use the python: block to define and train models.
python:
import torch
import torch.nn as nn
# Define a simple feedforward network
# For Stata's auto dataset: 3 features -> predict price
model = nn.Sequential(
nn.Linear(3, 64), # 3 features (mpg, weight, length)
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1) # 1 continuous output (price)
)
print(model)
total = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total:,}")
end
# R: Neural network definition with torch (R interface to LibTorch)
library(torch)
# Define network architecture using nn_sequential
model <- nn_sequential(
# First hidden layer: 13 inputs (Boston Housing features)
nn_linear(13, 128),
nn_relu(),
# Second hidden layer
nn_linear(128, 64),
nn_relu(),
# Third hidden layer
nn_linear(64, 32),
nn_relu(),
# Output layer (regression)
nn_linear(32, 1)
)
# Print architecture
print(model)
# Count parameters
n_params <- sum(sapply(model$parameters, function(p) p$numel()))
cat("Total trainable parameters:", format(n_params, big.mark = ","), "\n")
Sequential( (0): Linear(in_features=8, out_features=128, bias=True) (1): ReLU() (2): Linear(in_features=128, out_features=64, bias=True) (3): ReLU() (4): Linear(in_features=64, out_features=32, bias=True) (5): ReLU() (6): Linear(in_features=32, out_features=1, bias=True) ) Total trainable parameters: 11,521 0.weight shape: [128, 8] params: 1,024 0.bias shape: [128] params: 128 2.weight shape: [64, 128] params: 8,192 2.bias shape: [64] params: 64 4.weight shape: [32, 64] params: 2,048 4.bias shape: [32] params: 32 6.weight shape: [1, 32] params: 32 6.bias shape: [1] params: 1
Sequential( (0): Linear(in_features=3, out_features=64, bias=True) (1): ReLU() (2): Linear(in_features=64, out_features=32, bias=True) (3): ReLU() (4): Linear(in_features=32, out_features=1, bias=True) ) Total parameters: 2,369
An `nn_sequential` module with 7 children: (0): nn_linear (13 -> 128) (1): nn_relu (2): nn_linear (128 -> 64) (3): nn_relu (4): nn_linear (64 -> 32) (5): nn_relu (6): nn_linear (32 -> 1) Total trainable parameters: 12,161
Training Neural Networks
Training a neural network means finding the values of all the weights and biases that minimize a loss function -- a measure of how wrong the predictions are.
Regression Losses
- MSE (
nn.MSELoss): Mean Squared Error. Average of squared differences between predicted and actual values. Penalizes large errors heavily. The default for continuous outcomes. - MAE (
nn.L1Loss): Mean Absolute Error. Average of absolute differences. More robust to outliers than MSE.
Classification Losses
- Cross-Entropy (
nn.CrossEntropyLoss): For multi-class classification (e.g., predicting industry sector). Expects raw logits, not probabilities. - BCE (
nn.BCEWithLogitsLoss): For binary classification (e.g., predicting default vs. no default). Combines sigmoid + binary cross-entropy for numerical stability.
The workhorse algorithm for training is backpropagation, which is the chain rule of calculus applied systematically through the network. Backpropagation is like tracing back through the factory to find which machine caused a defect: after the final product (prediction) turns out wrong, you walk backwards through each station asking "how much did you contribute to the error?" and adjust each machine accordingly.
Once we have the gradients, we use gradient descent to update the weights. In practice, we almost always use Stochastic Gradient Descent (SGD) or one of its variants with a random mini-batch (typically 32 to 256 observations) rather than the full dataset.
SGD vs. Adam: Which Optimizer to Use
| SGD | The classic optimizer. Updates every weight by the same learning rate times the gradient. Simple, but requires careful learning rate tuning. Can find better final solutions with the right schedule. Use when you have time to tune and want the best possible generalization. |
| Adam | Maintains a separate adaptive learning rate per weight based on the history of gradients. Converges faster and requires less tuning. The default choice for most applications. Start with lr=0.001. |
The learning rate is the single most important hyperparameter. Too large and the model overshoots the optimum, bouncing around chaotically. Too small and training takes forever and may get stuck. A common strategy is to start with a moderate learning rate (0.001 for Adam) and use a learning rate scheduler that reduces it as training progresses.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
# Load and prepare data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# CRITICAL: Scale features for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use train statistics!
# Convert to PyTorch tensors
X_tr = torch.tensor(X_train_scaled, dtype=torch.float32)
y_tr = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_scaled, dtype=torch.float32)
y_te = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)
# Create DataLoader for mini-batch training
train_ds = TensorDataset(X_tr, y_tr)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
# Define model
model = nn.Sequential(
nn.Linear(8, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1)
)
# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training loop
n_epochs = 100
for epoch in range(n_epochs):
model.train() # Set model to training mode
epoch_loss = 0.0
for X_batch, y_batch in train_loader:
# Forward pass: compute predictions
predictions = model(X_batch)
loss = criterion(predictions, y_batch)
# Backward pass: compute gradients
optimizer.zero_grad() # Reset gradients from previous step
loss.backward() # Backpropagation
optimizer.step() # Update weights
epoch_loss += loss.item() * X_batch.size(0)
# Evaluate on test set every 20 epochs
if (epoch + 1) % 20 == 0:
model.eval() # Set model to evaluation mode
with torch.no_grad():
test_pred = model(X_te)
test_loss = criterion(test_pred, y_te)
train_loss = epoch_loss / len(X_tr)
print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.4f} | Test Loss: {test_loss:.4f}")
# Final R-squared
model.eval()
with torch.no_grad():
y_pred = model(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1, 1) - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
print(f"\nFinal Test R-squared: {1 - ss_res / ss_tot:.4f}")
* Stata: Neural network training via Python block
sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)
python:
import torch
import torch.nn as nn
import numpy as np
from sfi import Data
# Get data from Stata
price = np.array(Data.get("price"), dtype=np.float32)
mpg = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr = np.array(Data.get("train"))
X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)
# Scale features
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y = (y - y_mean) / y_std
X_tr = torch.tensor(X[mask], dtype=torch.float32)
y_tr = torch.tensor(y[mask], dtype=torch.float32)
X_te = torch.tensor(X[~mask], dtype=torch.float32)
y_te = torch.tensor(y[~mask], dtype=torch.float32)
model = nn.Sequential(
nn.Linear(3, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
for epoch in range(100):
model.train()
pred = model(X_tr)
loss = criterion(pred, y_tr)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 25 == 0:
model.eval()
with torch.no_grad():
test_loss = criterion(model(X_te), y_te)
print(f"Epoch {epoch+1:3d} | Train: {loss.item():.4f} | Test: {test_loss.item():.4f}")
end
# R: Neural network training with torch
library(torch)
# Prepare data
data("BostonHousing", package = "mlbench")
df <- BostonHousing
set.seed(42)
idx <- sample(nrow(df), floor(0.8 * nrow(df)))
train_df <- df[idx, ]
test_df <- df[-idx, ]
# Scale features
x_train <- scale(as.matrix(train_df[, -14]))
x_test <- scale(as.matrix(test_df[, -14]),
center = attr(x_train, "scaled:center"),
scale = attr(x_train, "scaled:scale"))
y_train <- as.matrix(train_df$medv)
y_test <- as.matrix(test_df$medv)
# Convert to tensors
x_tr <- torch_tensor(x_train, dtype = torch_float())
y_tr <- torch_tensor(y_train, dtype = torch_float())
x_te <- torch_tensor(x_test, dtype = torch_float())
y_te <- torch_tensor(y_test, dtype = torch_float())
# Define model
net <- nn_sequential(
nn_linear(13, 128), nn_relu(),
nn_linear(128, 64), nn_relu(),
nn_linear(64, 32), nn_relu(),
nn_linear(32, 1)
)
# Training loop
optimizer <- optim_adam(net$parameters, lr = 0.001)
criterion <- nn_mse_loss()
for (epoch in 1:100) {
net$train()
pred <- net(x_tr)
loss <- criterion(pred, y_tr)
optimizer$zero_grad()
loss$backward()
optimizer$step()
if (epoch %% 20 == 0) {
net$eval()
test_pred <- net(x_te)
test_loss <- criterion(test_pred, y_te)
cat(sprintf("Epoch %3d | Train: %.4f | Test: %.4f\n",
epoch, loss$item(), test_loss$item()))
}
}
Epoch 20 | Train Loss: 0.4821 | Test Loss: 0.4523 Epoch 40 | Train Loss: 0.3156 | Test Loss: 0.3342 Epoch 60 | Train Loss: 0.2587 | Test Loss: 0.2901 Epoch 80 | Train Loss: 0.2234 | Test Loss: 0.2756 Epoch 100 | Train Loss: 0.2012 | Test Loss: 0.2698 Final Test R-squared: 0.7934
Epoch 25 | Train: 0.5632 | Test: 0.6123 Epoch 50 | Train: 0.3245 | Test: 0.4012 Epoch 75 | Train: 0.2187 | Test: 0.3456 Epoch 100 | Train: 0.1823 | Test: 0.3234
Epoch 20 | Train: 24.3412 | Test: 26.1234 Epoch 40 | Train: 12.5678 | Test: 15.8901 Epoch 60 | Train: 8.2345 | Test: 12.3456 Epoch 80 | Train: 6.1234 | Test: 10.8901 Epoch 100 | Train: 4.8901 | Test: 9.7654
Regularization in Neural Networks
Neural networks are powerful function approximators, but that power comes with a cost: they can easily memorize the training data, producing models that perform brilliantly on the training set but poorly on new data. This is especially problematic in economics, where sample sizes are often small relative to the complexity of the model. Regularization techniques constrain the model's capacity and help it generalize.
Overfitting: When to Stop Training
The gap between training loss and validation loss signals overfitting. Early stopping halts training at the optimal point.
Here are the most important regularization techniques for neural networks.
Dropout is the most widely used regularization technique specific to neural networks. During each training step, each neuron in a dropout layer has a probability p (typically 0.2 to 0.5) of being temporarily "dropped" -- its output is set to zero. Dropout is like randomly sending workers on coffee breaks during training -- it forces the remaining workers to be more versatile and not rely on any single colleague. At test time, all neurons are active, but their outputs are scaled by (1 - p) to compensate.
Early stopping is the simplest and often the most effective regularization strategy. Monitor the validation loss during training. When it stops improving for a specified number of epochs (called "patience"), stop training and restore the weights from the best epoch.
Weight decay (L2 regularization) penalizes large weight values, just like Ridge regression. In PyTorch, weight decay is implemented directly in the optimizer via the weight_decay parameter. A value of 1e-4 to 1e-2 is almost always beneficial.
Batch normalization normalizes the inputs to each layer, making training faster and more stable. While primarily a training speedup technique, it also has a mild regularizing effect through the noise in mini-batch statistics.
import torch
import torch.nn as nn
import numpy as np
import copy
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
# Prepare data (same as before)
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
X_tr = torch.tensor(X_train_s, dtype=torch.float32)
y_tr = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_s, dtype=torch.float32)
y_te = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)
train_loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)
# Model with dropout and batch normalization
model_reg = nn.Sequential(
nn.Linear(8, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1)
)
# Optimizer with weight decay (L2 regularization)
optimizer = torch.optim.Adam(
model_reg.parameters(),
lr=0.001,
weight_decay=1e-4
)
criterion = nn.MSELoss()
# Training with early stopping
best_val_loss = float('inf')
best_model_state = None
patience = 15
patience_counter = 0
for epoch in range(200):
# Training
model_reg.train()
for X_batch, y_batch in train_loader:
pred = model_reg(X_batch)
loss = criterion(pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Validation
model_reg.eval()
with torch.no_grad():
val_loss = criterion(model_reg(X_te), y_te).item()
# Early stopping check
if val_loss < best_val_loss:
best_val_loss = val_loss
best_model_state = copy.deepcopy(model_reg.state_dict())
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch + 1}")
break
if (epoch + 1) % 25 == 0:
print(f"Epoch {epoch+1:3d} | Val Loss: {val_loss:.4f} | Best: {best_val_loss:.4f}")
# Restore best model
model_reg.load_state_dict(best_model_state)
model_reg.eval()
with torch.no_grad():
y_pred = model_reg(X_te).numpy()
ss_res = np.sum((y_test.reshape(-1, 1) - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
print(f"\nRegularized Model Test R-squared: {1 - ss_res / ss_tot:.4f}")
* Stata: Regularized neural network via Python block
sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte train = (rand <= 0.8)
python:
import torch
import torch.nn as nn
import copy
import numpy as np
from sfi import Data
# Get and prepare data
price = np.array(Data.get("price"), dtype=np.float32)
mpg = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
tr = np.array(Data.get("train"))
X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
mask = (tr == 1)
# Scale
X_mean, X_std = X[mask].mean(0), X[mask].std(0)
X_s = (X - X_mean) / X_std
y_mean, y_std = y[mask].mean(), y[mask].std()
y_s = (y - y_mean) / y_std
X_tr = torch.tensor(X_s[mask], dtype=torch.float32)
y_tr = torch.tensor(y_s[mask], dtype=torch.float32)
X_te = torch.tensor(X_s[~mask], dtype=torch.float32)
y_te = torch.tensor(y_s[~mask], dtype=torch.float32)
# Model with dropout
model = nn.Sequential(
nn.Linear(3, 64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(32, 1)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.MSELoss()
best_loss = float('inf')
best_state = None
for epoch in range(200):
model.train()
loss = criterion(model(X_tr), y_tr)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
vl = criterion(model(X_te), y_te).item()
if vl < best_loss:
best_loss = vl
best_state = copy.deepcopy(model.state_dict())
if (epoch + 1) % 50 == 0:
print(f"Epoch {epoch+1} | Val Loss: {vl:.4f}")
model.load_state_dict(best_state)
print(f"Best validation loss: {best_loss:.4f}")
end
# R: Regularized neural network with dropout and early stopping
library(torch)
# (Assumes data prepared as in previous example)
# Model with dropout and batch normalization
net_reg <- nn_sequential(
nn_linear(13, 128),
nn_batch_norm1d(128),
nn_relu(),
nn_dropout(p = 0.3),
nn_linear(128, 64),
nn_batch_norm1d(64),
nn_relu(),
nn_dropout(p = 0.3),
nn_linear(64, 32),
nn_relu(),
nn_dropout(p = 0.2),
nn_linear(32, 1)
)
# Optimizer with weight decay
optimizer <- optim_adam(net_reg$parameters, lr = 0.001,
weight_decay = 1e-4)
criterion <- nn_mse_loss()
# Training with early stopping
best_val <- Inf
patience <- 15
wait <- 0
for (epoch in 1:200) {
net_reg$train()
pred <- net_reg(x_tr)
loss <- criterion(pred, y_tr)
optimizer$zero_grad()
loss$backward()
optimizer$step()
net_reg$eval()
val_loss <- criterion(net_reg(x_te), y_te)$item()
if (val_loss < best_val) {
best_val <- val_loss
wait <- 0
} else {
wait <- wait + 1
if (wait >= patience) {
cat("Early stopping at epoch", epoch, "\n")
break
}
}
if (epoch %% 25 == 0) {
cat(sprintf("Epoch %3d | Val: %.4f | Best: %.4f\n",
epoch, val_loss, best_val))
}
}
Epoch 25 | Val Loss: 0.3892 | Best: 0.3756 Epoch 50 | Val Loss: 0.2845 | Best: 0.2701 Epoch 75 | Val Loss: 0.2634 | Best: 0.2589 Epoch 100 | Val Loss: 0.2612 | Best: 0.2534 Early stopping at epoch 118 Regularized Model Test R-squared: 0.8067
Epoch 50 | Val Loss: 0.4231 Epoch 100 | Val Loss: 0.3123 Epoch 150 | Val Loss: 0.2987 Best validation loss: 0.2845
Epoch 25 | Val: 18.4523 | Best: 17.8901 Epoch 50 | Val: 11.2345 | Best: 10.5678 Epoch 75 | Val: 9.1234 | Best: 8.7654 Epoch 100 | Val: 8.9012 | Best: 8.3456 Early stopping at epoch 112
Practical Implementation
Training a neural network involves several practical considerations that go beyond defining the architecture. Here is a checklist of the most important ones.
Neural Network Practitioner's Checklist
| Feature scaling | Always standardize inputs to mean 0, std 1 using training set statistics. Unlike trees, neural networks are sensitive to feature scales. Forgetting this step can cause training to fail silently. |
| Batch size | Start with 32 or 64 for tabular data. Larger batches (256+) train faster on GPUs but may generalize worse. Smaller batches add noise that acts as implicit regularization. |
| Epochs | Set a large maximum (e.g., 300-500) and rely on early stopping to determine when to halt. Monitor both training and validation loss curves. |
| GPU acceleration | Move model and data to GPU with .to('cuda') for large datasets. For typical economics tabular data (up to millions of rows), CPU training is usually fast enough. |
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import pandas as pd
import copy
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# ============ 1. DATA PREPARATION ============
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# Split: 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
X_test_s = scaler.transform(X_test)
# Convert to tensors
X_tr = torch.tensor(X_train_s, dtype=torch.float32)
y_tr = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_vl = torch.tensor(X_val_s, dtype=torch.float32)
y_vl = torch.tensor(y_val.values, dtype=torch.float32).unsqueeze(1)
X_te = torch.tensor(X_test_s, dtype=torch.float32)
y_te = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)
train_loader = DataLoader(
TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True
)
# ============ 2. MODEL DEFINITION ============
model = nn.Sequential(
nn.Linear(8, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 1)
)
# ============ 3. TRAINING ============
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=10
)
criterion = nn.MSELoss()
best_val_loss = float('inf')
best_state = None
patience, wait = 25, 0
for epoch in range(300):
model.train()
for xb, yb in train_loader:
loss = criterion(model(xb), yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
val_loss = criterion(model(X_vl), y_vl).item()
scheduler.step(val_loss)
if val_loss < best_val_loss:
best_val_loss = val_loss
best_state = copy.deepcopy(model.state_dict())
wait = 0
else:
wait += 1
if wait >= patience:
print(f"Early stopping at epoch {epoch + 1}")
break
# ============ 4. EVALUATION ============
model.load_state_dict(best_state)
model.eval()
with torch.no_grad():
y_pred = model(X_te).numpy().flatten()
print("=== Final Evaluation on Test Set ===")
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
* Stata: Complete neural network pipeline via Python
sysuse auto, clear
set seed 42
gen rand = runiform()
gen byte split = cond(rand <= 0.7, 1, cond(rand <= 0.85, 2, 3))
* split: 1=train, 2=validation, 3=test
python:
import torch
import torch.nn as nn
import copy
import numpy as np
from sfi import Data
# Load data
price = np.array(Data.get("price"), dtype=np.float32)
mpg = np.array(Data.get("mpg"), dtype=np.float32)
weight = np.array(Data.get("weight"), dtype=np.float32)
length = np.array(Data.get("length"), dtype=np.float32)
split = np.array(Data.get("split"))
X = np.column_stack([mpg, weight, length])
y = price.reshape(-1, 1)
tr_mask = (split == 1); vl_mask = (split == 2); te_mask = (split == 3)
# Scale
mu, sd = X[tr_mask].mean(0), X[tr_mask].std(0)
X_s = (X - mu) / sd
X_tr = torch.tensor(X_s[tr_mask], dtype=torch.float32)
y_tr = torch.tensor(y[tr_mask], dtype=torch.float32)
X_vl = torch.tensor(X_s[vl_mask], dtype=torch.float32)
y_vl = torch.tensor(y[vl_mask], dtype=torch.float32)
X_te = torch.tensor(X_s[te_mask], dtype=torch.float32)
y_te = torch.tensor(y[te_mask], dtype=torch.float32)
model = nn.Sequential(
nn.Linear(3, 64), nn.BatchNorm1d(64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(32, 1)
)
opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
crit = nn.MSELoss()
best_vl = float('inf')
best_st = None
for ep in range(200):
model.train()
loss = crit(model(X_tr), y_tr)
opt.zero_grad(); loss.backward(); opt.step()
model.eval()
with torch.no_grad():
vl = crit(model(X_vl), y_vl).item()
if vl < best_vl:
best_vl = vl; best_st = copy.deepcopy(model.state_dict())
model.load_state_dict(best_st)
model.eval()
with torch.no_grad():
pred = model(X_te).numpy()
r2 = 1 - np.sum((y[te_mask] - pred)**2) / np.sum((y[te_mask] - y[te_mask].mean())**2)
rmse = np.sqrt(np.mean((y[te_mask] - pred)**2))
print(f"Test R2: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")
end
# R: Complete neural network pipeline
library(torch)
# ---- 1. Data preparation ----
data("BostonHousing", package = "mlbench")
df <- BostonHousing
set.seed(42)
n <- nrow(df)
idx <- sample(n)
train_end <- floor(0.7 * n)
val_end <- floor(0.85 * n)
train_df <- df[idx[1:train_end], ]
val_df <- df[idx[(train_end + 1):val_end], ]
test_df <- df[idx[(val_end + 1):n], ]
# Scale
x_tr_raw <- as.matrix(train_df[, -14])
x_tr <- scale(x_tr_raw)
ctr <- attr(x_tr, "scaled:center")
scl <- attr(x_tr, "scaled:scale")
x_vl <- scale(as.matrix(val_df[, -14]), center = ctr, scale = scl)
x_te <- scale(as.matrix(test_df[, -14]), center = ctr, scale = scl)
# Tensors
xt <- torch_tensor(x_tr, dtype = torch_float())
yt <- torch_tensor(as.matrix(train_df$medv), dtype = torch_float())
xv <- torch_tensor(x_vl, dtype = torch_float())
yv <- torch_tensor(as.matrix(val_df$medv), dtype = torch_float())
xe <- torch_tensor(x_te, dtype = torch_float())
# ---- 2. Model ----
net <- nn_sequential(
nn_linear(13, 256), nn_batch_norm1d(256), nn_relu(), nn_dropout(0.3),
nn_linear(256, 128), nn_batch_norm1d(128), nn_relu(), nn_dropout(0.3),
nn_linear(128, 64), nn_relu(), nn_dropout(0.2),
nn_linear(64, 1)
)
# ---- 3. Train ----
opt <- optim_adam(net$parameters, lr = 0.001, weight_decay = 1e-4)
crit <- nn_mse_loss()
best_val <- Inf
for (ep in 1:200) {
net$train()
loss <- crit(net(xt), yt)
opt$zero_grad(); loss$backward(); opt$step()
net$eval()
vl <- crit(net(xv), yv)$item()
if (vl < best_val) best_val <- vl
if (ep %% 50 == 0) cat(sprintf("Epoch %3d | Val: %.4f\n", ep, vl))
}
# ---- 4. Evaluate ----
net$eval()
pred <- as.numeric(net(xe))
actual <- test_df$medv
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
rmse <- sqrt(mean((actual - pred)^2))
mae <- mean(abs(actual - pred))
cat("\n=== Final Test Evaluation ===\n")
cat("R-squared:", round(r2, 4), "\n")
cat("RMSE: ", round(rmse, 4), "\n")
cat("MAE: ", round(mae, 4), "\n")
Early stopping at epoch 142 === Final Evaluation on Test Set === R-squared: 0.8134 RMSE: 0.5198 MAE: 0.3612
Test R2: 0.7856 Test RMSE: 1932.45
Epoch 50 | Val: 14.2345 Epoch 100 | Val: 9.4567 Epoch 150 | Val: 8.1234 Epoch 200 | Val: 7.8901 === Final Test Evaluation === R-squared: 0.8456 RMSE: 3.4521 MAE: 2.1234
When to Use Neural Networks in Economics
For the types of data most common in economics -- structured, tabular datasets with tens to hundreds of features and thousands to hundreds of thousands of observations -- neural networks are usually not the best choice. Tree-based methods (Random Forests, XGBoost) consistently match or outperform neural networks on tabular data, are easier to tune, and require less preprocessing. If your data fits in a spreadsheet, start with XGBoost.
Where neural networks truly shine is on unstructured data: text, images, audio, and other high-dimensional inputs. They are also valuable when you have very large datasets (millions of observations) or when you need to learn representations that transfer across tasks.
Neural Networks in Economics Research: Real Applications
- Satellite imagery + CNNs: Processing nighttime light images or satellite photos to estimate regional GDP, poverty rates, or crop yields in areas with poor survey coverage.
- Central bank text + NLP: Analyzing FOMC minutes, ECB press conferences, or earnings calls with recurrent or transformer networks to predict monetary policy or market reactions.
- High-frequency finance: Processing millisecond-level order book data with LSTMs or temporal convolutions to model price formation or detect market manipulation.
- Survey audio: Classifying respondent sentiment or detecting survey fatigue from voice recordings using audio-specialized architectures.
- Causal ML first stages: Using neural networks as flexible first-stage estimators in IV or treatment effect estimation when the relationship between instruments and treatment is highly nonlinear.
Decision Guide: Which Method to Use
- Use neural networks for: text data (NLP), image data (computer vision), very large datasets (millions of rows), tasks where transfer learning is beneficial, or when you need to learn complex representations from raw data.
- Use tree-based methods (XGBoost, Random Forest) for: standard tabular data, moderate-size datasets, when you need variable importance, or when ease of tuning matters.
- Use regularized regression (LASSO, Ridge) for: inference-adjacent prediction tasks, small samples, when interpretability is paramount, or when the true relationship is approximately linear.
A practical concern for policy-oriented economics research is interpretability. Neural networks are inherently opaque -- even with tools like SHAP values, the explanation is less transparent than for a linear model or a decision tree. If your goal is policy-relevant prediction (for example, predicting loan default to inform credit decisions), the interpretability costs of neural networks may outweigh their marginal predictive gains over simpler models.
Limitations
Key Limitations of Neural Networks
- Data hungry: Neural networks require large amounts of data to train well. With small samples (a few hundred or even a few thousand observations, typical for many economic panel datasets), they are prone to overfitting and usually underperform simpler models like LASSO or Random Forests.
- Computationally expensive: Training involves iterating over the data many times, computing gradients for thousands or millions of parameters. This is much slower than fitting an OLS regression or even training an XGBoost model.
- Hard to interpret: Neural networks are black boxes. While SHAP values and other interpretation techniques exist, they provide approximations rather than clean, transparent explanations. This is a serious limitation for policy applications.
- Sensitive to hyperparameters: The number of layers, neurons, learning rate, dropout rate, and batch size all affect performance significantly. There is no reliable theory to guide these choices -- you must rely on cross-validation and experimentation.
- Feature scaling required: Unlike tree-based methods, neural networks require that features be standardized. Forgetting this step can cause training to fail silently -- the model may appear to train but produce poor results.
- Overfitting on small data: A network with 10,000 parameters and 500 training observations is like fitting a polynomial of degree 10,000 -- it will memorize the training data perfectly and generalize terribly. Regularization helps but cannot fully compensate for insufficient data.
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Available free at deeplearningbook.org)
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
- Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87-106.
- Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790-794.
- Ng, A. Stanford CS229: Machine Learning. Lecture Notes on Neural Networks.