8 Replicability & Reproducibility
Learning Objectives
- Understand the difference between replicability and reproducibility
- Organize research projects with clear folder structures
- Write documentation that enables others to run your code
- Create publication-ready replication packages
- Successfully replicate published economics papers
8.1 Definitions
| Concept | Definition | Question |
|---|---|---|
| Reproducibility | Same data + same code = same results | Can I re-run this and get identical output? |
| Replicability | New data + same methods = consistent findings | Does this hold in a different sample? |
8.2 Project Organization
A well-organized project folder is the foundation of reproducible research. Here's what a complete research project should look like:
Key Principles
- Raw data is sacred: Never modify original files—all cleaning happens in code
- One master script: Runs entire analysis end-to-end with a single click
- Relative paths only: Never use absolute paths like
C:\Users\... - Number your scripts: Makes execution order obvious (00, 01, 02...)
- Version everything: Use Git (see Module 9)
- Separate outputs: Tables, figures, and logs in their own folders
8.3 Understanding File Paths
File paths tell your computer where to find files. Understanding them is essential for writing code that works on any machine.
Absolute vs. Relative Paths
| Type | Example | Verdict |
|---|---|---|
| Absolute Path | C:\Users\Maria\Projects\thesis\data\raw\survey.dta |
❌ Never use in code |
| Relative Path | data/raw/survey.dta |
✓ Always use this |
A relative path starts from your current location (the "working directory"). Think of it like giving directions: "go to the data folder, then the raw subfolder, then find survey.dta."
Path Syntax: Mac/Linux vs. Windows
# Mac/Linux use forward slashes /
/Users/giulia/Projects/thesis/ # Absolute path
data/raw/survey.csv # Relative path
# Navigate up one folder with ..
../other_project/data.csv # Go up, then into other_project
# Home directory shortcut
~/Projects/thesis/ # ~ means your home folder
# Windows traditionally uses backslashes \
C:\Users\Giulia\Projects\thesis\ # Absolute path
data\raw\survey.csv # Relative path
# But forward slashes work in most modern software!
data/raw/survey.csv # This works in Stata, Python, R
# Navigate up one folder with ..
..\other_project\data.csv # Go up, then into other_project
/) everywhere, even on Windows. Stata, Python, and R all understand them, making your code cross-platform compatible.
Setting the Working Directory
The working directory is your "home base"—all relative paths start from here. Set it once at the beginning of your master script.
* Check current working directory
pwd
* Set working directory (do this ONCE in master script)
cd "/Users/giulia/Projects/minimum-wage-study"
* Now all paths are relative to this location
use "data/raw/cps_2019.dta", clear
export delimited "data/processed/clean_data.csv"
# Check current working directory
getwd()
# Set working directory
setwd("/Users/giulia/Projects/minimum-wage-study")
# Better: use here package (finds project root automatically)
library(here)
data <- read_csv(here("data", "raw", "cps_2019.csv"))
# Or use RStudio Projects (.Rproj files)
import os
from pathlib import Path
# Check current working directory
print(os.getcwd())
# Set working directory
os.chdir("/Users/giulia/Projects/minimum-wage-study")
# Better: use pathlib for cross-platform paths
project_root = Path(__file__).parent.parent # Go up from code/ folder
data_path = project_root / "data" / "raw" / "cps_2019.csv"
8.4 The Master Script
The master script is the single entry point that reproduces your entire analysis. A reviewer or co-author should be able to run this one file and regenerate all results.
- Clear header with project info and instructions
- One location to set the root path (the only absolute path in the entire project)
- Global macros for all folder paths
- Sequential execution of all sub-scripts
- Timestamps and logging
/*==============================================================================
MASTER DO-FILE: Minimum Wage and Employment
Paper: "The Employment Effects of Minimum Wage Increases"
Authors: Smith, J. and Jones, M.
Version: 1.0
Date: January 2025
INSTRUCTIONS:
1. Set the root path below to your project folder location
2. Ensure all required packages are installed (see below)
3. Run this entire file to reproduce all results
REQUIRED PACKAGES: estout, reghdfe, ftools, coefplot
Install with: ssc install estout, replace
================================================================================*/
clear all
set more off
cap log close
*------------------------------------------------------------------------------*
* SET ROOT PATH - MODIFY THIS LINE ONLY
*------------------------------------------------------------------------------*
* Users should change this path to their own project folder location
if "`c(username)'" == "giulia" {
global root "/Users/giulia/Projects/minimum-wage-study"
}
else if "`c(username)'" == "john" {
global root "C:/Users/john/Research/minimum-wage-study"
}
else {
* Default: assume current directory is project root
global root "`c(pwd)'"
}
*------------------------------------------------------------------------------*
* DEFINE FOLDER PATHS (do not modify)
*------------------------------------------------------------------------------*
global data "$root/data"
global raw "$data/raw"
global processed "$data/processed"
global code "$root/code"
global output "$root/output"
global tables "$output/tables"
global figures "$output/figures"
global logs "$output/logs"
*------------------------------------------------------------------------------*
* CREATE FOLDERS IF THEY DON'T EXIST
*------------------------------------------------------------------------------*
cap mkdir "$processed"
cap mkdir "$output"
cap mkdir "$tables"
cap mkdir "$figures"
cap mkdir "$logs"
*------------------------------------------------------------------------------*
* START LOG FILE
*------------------------------------------------------------------------------*
local datetime: di %tcCCYY-NN-DD-HH-MM-SS Clock("`c(current_date)' `c(current_time)'","DMYhms")
log using "$logs/master_log_`datetime'.txt", text replace
di "=============================================="
di " MASTER SCRIPT STARTED"
di " Date/Time: `c(current_date)' `c(current_time)'"
di " Stata version: `c(stata_version)'"
di " Root directory: $root"
di "=============================================="
*------------------------------------------------------------------------------*
* RUN ANALYSIS SCRIPTS IN ORDER
*------------------------------------------------------------------------------*
di _n ">>> Running 01_import_clean.do..."
do "$code/01_import_clean.do"
di _n ">>> Running 02_merge_construct.do..."
do "$code/02_merge_construct.do"
di _n ">>> Running 03_descriptive_stats.do..."
do "$code/03_descriptive_stats.do"
di _n ">>> Running 04_main_analysis.do..."
do "$code/04_main_analysis.do"
di _n ">>> Running 05_robustness.do..."
do "$code/05_robustness.do"
*------------------------------------------------------------------------------*
* COMPLETION
*------------------------------------------------------------------------------*
di _n "=============================================="
di " MASTER SCRIPT COMPLETED SUCCESSFULLY"
di " End time: `c(current_date)' `c(current_time)'"
di " All outputs saved to: $output"
di "=============================================="
log close
#===============================================================================
# MASTER SCRIPT: Minimum Wage and Employment
#
# Paper: "The Employment Effects of Minimum Wage Increases"
# Authors: Smith, J. and Jones, M.
# Version: 1.0
# Date: January 2025
#
# INSTRUCTIONS:
# 1. Open the .Rproj file in RStudio (sets working directory automatically)
# 2. Run this entire script to reproduce all results
#
# REQUIRED PACKAGES: tidyverse, fixest, modelsummary, here
#===============================================================================
# Clear environment
rm(list = ls())
# Load packages (install if needed)
required_packages <- c("tidyverse", "fixest", "modelsummary", "here", "haven")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library(tidyverse)
library(fixest)
library(modelsummary)
library(here)
library(haven)
#------------------------------------------------------------------------------
# DEFINE PATHS (using here package - no manual path setting needed!)
#------------------------------------------------------------------------------
paths <- list(
raw = here("data", "raw"),
processed = here("data", "processed"),
tables = here("output", "tables"),
figures = here("output", "figures"),
logs = here("output", "logs")
)
# Create directories if they don't exist
walk(paths, ~dir.create(.x, recursive = TRUE, showWarnings = FALSE))
#------------------------------------------------------------------------------
# START LOG
#------------------------------------------------------------------------------
sink(file.path(paths$logs, paste0("master_log_", Sys.Date(), ".txt")))
cat("==============================================\n")
cat(" MASTER SCRIPT STARTED\n")
cat(" Date/Time:", format(Sys.time()), "\n")
cat(" R version:", R.version.string, "\n")
cat(" Project root:", here(), "\n")
cat("==============================================\n\n")
#------------------------------------------------------------------------------
# RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
cat("\n>>> Running 01_import_clean.R...\n")
source(here("code", "01_import_clean.R"))
cat("\n>>> Running 02_merge_construct.R...\n")
source(here("code", "02_merge_construct.R"))
cat("\n>>> Running 03_descriptive_stats.R...\n")
source(here("code", "03_descriptive_stats.R"))
cat("\n>>> Running 04_main_analysis.R...\n")
source(here("code", "04_main_analysis.R"))
cat("\n>>> Running 05_robustness.R...\n")
source(here("code", "05_robustness.R"))
#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
cat("\n==============================================\n")
cat(" MASTER SCRIPT COMPLETED SUCCESSFULLY\n")
cat(" End time:", format(Sys.time()), "\n")
cat(" All outputs saved to:", here("output"), "\n")
cat("==============================================\n")
sink()
"""
================================================================================
MASTER SCRIPT: Minimum Wage and Employment
Paper: "The Employment Effects of Minimum Wage Increases"
Authors: Smith, J. and Jones, M.
Version: 1.0
Date: January 2025
INSTRUCTIONS:
1. Navigate to project folder in terminal
2. Run: python code/00_master.py
REQUIRED PACKAGES: pandas, numpy, statsmodels, matplotlib, linearmodels
Install with: pip install -r requirements.txt
================================================================================
"""
import os
import sys
from pathlib import Path
from datetime import datetime
import subprocess
#------------------------------------------------------------------------------
# DEFINE PATHS
#------------------------------------------------------------------------------
# Get project root (parent of code folder)
ROOT = Path(__file__).resolve().parent.parent
PATHS = {
'raw': ROOT / 'data' / 'raw',
'processed': ROOT / 'data' / 'processed',
'tables': ROOT / 'output' / 'tables',
'figures': ROOT / 'output' / 'figures',
'logs': ROOT / 'output' / 'logs',
'code': ROOT / 'code'
}
# Create directories if they don't exist
for path in PATHS.values():
path.mkdir(parents=True, exist_ok=True)
#------------------------------------------------------------------------------
# LOGGING SETUP
#------------------------------------------------------------------------------
log_file = PATHS['logs'] / f"master_log_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}.txt"
class Logger:
def __init__(self, filename):
self.terminal = sys.stdout
self.log = open(filename, 'w')
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def flush(self):
self.terminal.flush()
self.log.flush()
sys.stdout = Logger(log_file)
#------------------------------------------------------------------------------
# START
#------------------------------------------------------------------------------
print("=" * 50)
print(" MASTER SCRIPT STARTED")
print(f" Date/Time: {datetime.now()}")
print(f" Python version: {sys.version}")
print(f" Project root: {ROOT}")
print("=" * 50)
#------------------------------------------------------------------------------
# RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
scripts = [
'01_import_clean.py',
'02_merge_construct.py',
'03_descriptive_stats.py',
'04_main_analysis.py',
'05_robustness.py'
]
for script in scripts:
print(f"\n>>> Running {script}...")
script_path = PATHS['code'] / script
exec(open(script_path).read())
#------------------------------------------------------------------------------
# COMPLETION
#------------------------------------------------------------------------------
print("\n" + "=" * 50)
print(" MASTER SCRIPT COMPLETED SUCCESSFULLY")
print(f" End time: {datetime.now()}")
print(f" All outputs saved to: {ROOT / 'output'}")
print("=" * 50)
8.5 Documentation: The README
Every project needs a README answering: What is this? What data? What software? How to run?
# Project Title
## Overview
Brief description of the research question and findings.
## Data
- **Source**: Where data comes from (e.g., "CPS via IPUMS")
- **Access**: How to obtain it (include URLs or instructions)
- **Files**: List of data files and what they contain
## Software Requirements
- Stata 17 or later
- Required packages: estout, reghdfe, ftools
## Instructions
1. Clone or download this repository
2. Open `code/00_master.do`
3. Set the root path on line 25 to your project folder
4. Run the master file
## File Structure
[Include folder tree here]
## Output
Running the master script will generate:
- Tables 1-5 in `output/tables/`
- Figures 1-3 in `output/figures/`
## Contact
Your Name (email@university.edu)
8.6 Automated LaTeX Tables
Never copy-paste results into tables manually. Export publication-ready LaTeX tables directly from your statistical software.
- Run regression in Stata
- Copy numbers from output window
- Paste into Excel
- Format in Excel
- Copy to Word or manually type into LaTeX
Problems: Error-prone, time-consuming, not reproducible, nightmare when results change
- Run regression in Stata
- Export to .tex file with one command
- LaTeX imports table automatically
- Results change? Just re-run the code!
Stata: esttab / estout
The estout package is the gold standard for exporting Stata results to LaTeX.
* Install the package (once)
ssc install estout, replace
* Run your regressions and store estimates
reg earnings education age, robust
estimates store m1
reg earnings education age experience, robust
estimates store m2
reg earnings education age experience i.industry, robust
estimates store m3
* Export to LaTeX
esttab m1 m2 m3 using "$tables/main_results.tex", ///
replace ///
label /// Use variable labels
b(3) se(3) /// 3 decimal places
star(* 0.10 ** 0.05 *** 0.01) /// Significance stars
title("Effect of Education on Earnings") ///
mtitles("Basic" "Controls" "Industry FE") /// Column titles
scalars("r2 R-squared" "N Observations") ///
nonotes ///
addnotes("Standard errors in parentheses." ///
"* p<0.10, ** p<0.05, *** p<0.01")
This produces a complete LaTeX table that looks like:
| (1) Basic | (2) Controls | (3) Industry FE | |
|---|---|---|---|
| Education (years) | 0.089*** | 0.074*** | 0.068*** |
| (0.003) | (0.003) | (0.004) | |
| Age | 0.012*** | 0.008*** | 0.007*** |
| (0.001) | (0.001) | (0.001) | |
| R-squared | 0.142 | 0.187 | 0.234 |
| Observations | 45,231 | 45,231 | 45,231 |
Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01
R: modelsummary
library(modelsummary)
library(fixest)
# Run regressions
m1 <- feols(earnings ~ education + age, data = df)
m2 <- feols(earnings ~ education + age + experience, data = df)
m3 <- feols(earnings ~ education + age + experience | industry, data = df)
# Export to LaTeX
modelsummary(
list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
output = here("output", "tables", "main_results.tex"),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
coef_rename = c("education" = "Education (years)", "age" = "Age"),
gof_omit = "AIC|BIC|Log",
title = "Effect of Education on Earnings",
notes = "Standard errors in parentheses."
)
Python: stargazer
from stargazer.stargazer import Stargazer
import statsmodels.api as sm
# Run regressions
X1 = sm.add_constant(df[['education', 'age']])
m1 = sm.OLS(df['earnings'], X1).fit(cov_type='HC1')
X2 = sm.add_constant(df[['education', 'age', 'experience']])
m2 = sm.OLS(df['earnings'], X2).fit(cov_type='HC1')
# Export to LaTeX
stargazer = Stargazer([m1, m2])
stargazer.title("Effect of Education on Earnings")
stargazer.custom_columns(["Basic", "Controls"])
with open(PATHS['tables'] / 'main_results.tex', 'w') as f:
f.write(stargazer.render_latex())
Including Tables in LaTeX
In your paper's .tex file, simply input the generated table:
\begin{table}[htbp]
\centering
\input{../output/tables/main_results.tex}
\end{table}
8.7 One-Click Pipeline: Full Integration
The ultimate goal is a workflow where one command reproduces your entire paper—data cleaning, analysis, tables, figures, and the compiled PDF.
The Dream Workflow
One command. Fully reproducible. No manual intervention.
Option 1: Makefile (Most Powerful)
A Makefile specifies dependencies and only re-runs what's needed:
# Makefile for research project
# Run with: make all
.PHONY: all clean data analysis paper
# Default target
all: paper
# Data processing (depends on raw data)
data/processed/clean_data.dta: data/raw/cps_2019.dta code/01_import_clean.do
stata-mp -b do code/01_import_clean.do
# Analysis (depends on processed data)
output/tables/main_results.tex: data/processed/clean_data.dta code/04_main_analysis.do
stata-mp -b do code/04_main_analysis.do
output/figures/figure1.pdf: data/processed/clean_data.dta code/04_main_analysis.do
stata-mp -b do code/04_main_analysis.do
# Paper (depends on tables and figures)
paper/main.pdf: paper/main.tex output/tables/main_results.tex output/figures/figure1.pdf
cd paper && pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex
paper: paper/main.pdf
# Clean all generated files
clean:
rm -f data/processed/*.dta
rm -f output/tables/*.tex
rm -f output/figures/*.pdf
rm -f paper/*.pdf paper/*.aux paper/*.log paper/*.bbl paper/*.blg
Option 2: Shell Script (Simpler)
#!/bin/bash
# run_all.sh - Reproduce entire paper
echo "Starting full pipeline..."
echo "========================="
# Run Stata analysis
echo "Step 1: Running Stata analysis..."
stata-mp -b do code/00_master.do
# Run Python visualizations (if any)
echo "Step 2: Running Python scripts..."
python code/06_visualizations.py
# Compile LaTeX paper
echo "Step 3: Compiling paper..."
cd paper
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
echo "========================="
echo "Done! Paper saved to paper/main.pdf"
Option 3: Overleaf + Dropbox/GitHub Sync
For teams using Overleaf, you can sync your output folder:
- Link Overleaf to GitHub: Overleaf can pull from a GitHub repository
- Structure: Put your
output/folder in the repo - Workflow:
- Run analysis locally → tables/figures saved to
output/ - Push to GitHub
- Pull in Overleaf → paper auto-compiles with new results
- Run analysis locally → tables/figures saved to
# After running your analysis
git add output/tables/*.tex output/figures/*.pdf
git commit -m "Update results"
git push
# Overleaf syncs automatically if linked to the repo
8.8 LaTeX Paper Structure
Large papers should be split into sections for easier editing and collaboration. Use a master file that inputs each section.
Project Structure
Master File (main.tex)
% main.tex - Master file for paper
\documentclass[12pt]{article}
% Load preamble (packages, settings, custom commands)
\input{preamble}
\title{The Employment Effects of Minimum Wage Increases}
\author{Jane Smith\thanks{University of Example. Email: jsmith@example.edu}}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
Your abstract here...
\end{abstract}
% Main content - each section in separate file
\input{sections/01_introduction}
\input{sections/02_literature}
\input{sections/03_data}
\input{sections/04_methodology}
\input{sections/05_results}
\input{sections/06_conclusion}
% Bibliography
\bibliographystyle{aer}
\bibliography{bibliography}
% Appendix
\appendix
\input{sections/appendix}
\end{document}
Preamble File (preamble.tex)
% preamble.tex - Packages and settings
% Essential packages
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs} % Professional tables
\usepackage{natbib} % Citations
\usepackage{hyperref} % Clickable links
\usepackage{setspace} % Line spacing
\usepackage[margin=1in]{geometry}
% For including Stata/R output tables
\usepackage{threeparttable} % Table notes
\usepackage{tabularx} % Flexible columns
\usepackage{siunitx} % Number alignment
% Path to figures and tables (relative to main.tex)
\graphicspath{{../output/figures/}{figures/}}
% Custom commands for convenience
\newcommand{\tabpath}{../output/tables/}
% Double spacing for submission
\doublespacing
Including Generated Tables
% In sections/05_results.tex
\section{Results}
Table \ref{tab:main} presents our main results...
\begin{table}[htbp]
\centering
\caption{Effect of Minimum Wage on Employment}
\label{tab:main}
\input{\tabpath main_results.tex}
\end{table}
The coefficient on minimum wage in column (3) suggests that...
Compiling the Paper
# From the paper/ directory
cd paper
# Full compilation with bibliography
pdflatex main.tex # First pass
bibtex main # Process citations
pdflatex main.tex # Resolve references
pdflatex main.tex # Final pass
# Or use latexmk (handles everything automatically)
latexmk -pdf main.tex
- Auto-compilation on save
- PDF preview side-by-side
- Syntax highlighting and snippets
- Forward/inverse search (click in PDF → go to code)
8.9 Replicating Published Papers
Where to find replication materials:
- AEA Data Repository: openicpsr.org
- Harvard Dataverse: dataverse.harvard.edu
- Author websites and GitHub
Steps to replicate:
- Find and download replication package
- Read the paper and README carefully
- Install required software/packages
- Run the master script
- Compare output to published tables
Gentzkow, M. & Shapiro, J. (2014). "Code and Data for the Social Sciences"