8  Replicability & Reproducibility

~6 hours Research Best Practices Intermediate

Learning Objectives

  • Understand the difference between replicability and reproducibility
  • Organize research projects with clear folder structures
  • Write documentation that enables others to run your code
  • Create publication-ready replication packages
  • Successfully replicate published economics papers

8.1 Definitions

ConceptDefinitionQuestion
ReproducibilitySame data + same code = same resultsCan I re-run this and get identical output?
ReplicabilityNew data + same methods = consistent findingsDoes this hold in a different sample?
Why This Matters: Major economics journals (AER, QJE, Econometrica) now require replication packages. A 2019 study found that only 37% of published papers could be successfully reproduced. Good practices from day one save enormous time later and increase your credibility as a researcher.

8.2 Project Organization

A well-organized project folder is the foundation of reproducible research. Here's what a complete research project should look like:

📁 minimum-wage-study/
├── 📄 README.md ← Start here! Instructions for everything
├── 📄 LICENSE ← How others can use your code
├── 📁 data/
├── 📁 raw/ ← NEVER modify these files!
├── cps_2019.dta
└── state_minimum_wages.csv
├── 📁 processed/ ← Generated by code
└── 📄 data_dictionary.md ← Variable definitions
├── 📁 code/
├── 00_master.do ← Run THIS to reproduce everything
├── 01_import_clean.do
├── 02_merge_construct.do
├── 03_descriptive_stats.do
├── 04_main_analysis.do
└── 05_robustness.do
├── 📁 output/
├── 📁 tables/ ← .tex files for LaTeX
├── 📁 figures/ ← .pdf or .png graphs
└── 📁 logs/ ← Stata log files
└── 📁 paper/
├── main.tex ← Master LaTeX file
├── sections/
└── bibliography.bib

Key Principles

  • Raw data is sacred: Never modify original files—all cleaning happens in code
  • One master script: Runs entire analysis end-to-end with a single click
  • Relative paths only: Never use absolute paths like C:\Users\...
  • Number your scripts: Makes execution order obvious (00, 01, 02...)
  • Version everything: Use Git (see Module 9)
  • Separate outputs: Tables, figures, and logs in their own folders

8.3 Understanding File Paths

File paths tell your computer where to find files. Understanding them is essential for writing code that works on any machine.

Absolute vs. Relative Paths

TypeExampleVerdict
Absolute Path C:\Users\Maria\Projects\thesis\data\raw\survey.dta ❌ Never use in code
Relative Path data/raw/survey.dta ✓ Always use this

A relative path starts from your current location (the "working directory"). Think of it like giving directions: "go to the data folder, then the raw subfolder, then find survey.dta."

Path Syntax: Mac/Linux vs. Windows

# Mac/Linux use forward slashes /
/Users/giulia/Projects/thesis/          # Absolute path
data/raw/survey.csv                      # Relative path

# Navigate up one folder with ..
../other_project/data.csv               # Go up, then into other_project

# Home directory shortcut
~/Projects/thesis/                       # ~ means your home folder
# Windows traditionally uses backslashes \
C:\Users\Giulia\Projects\thesis\        # Absolute path
data\raw\survey.csv                      # Relative path

# But forward slashes work in most modern software!
data/raw/survey.csv                      # This works in Stata, Python, R

# Navigate up one folder with ..
..\other_project\data.csv               # Go up, then into other_project
Pro Tip: Use forward slashes (/) everywhere, even on Windows. Stata, Python, and R all understand them, making your code cross-platform compatible.

Setting the Working Directory

The working directory is your "home base"—all relative paths start from here. Set it once at the beginning of your master script.

* Check current working directory
pwd

* Set working directory (do this ONCE in master script)
cd "/Users/giulia/Projects/minimum-wage-study"

* Now all paths are relative to this location
use "data/raw/cps_2019.dta", clear
export delimited "data/processed/clean_data.csv"
# Check current working directory
getwd()

# Set working directory
setwd("/Users/giulia/Projects/minimum-wage-study")

# Better: use here package (finds project root automatically)
library(here)
data <- read_csv(here("data", "raw", "cps_2019.csv"))

# Or use RStudio Projects (.Rproj files)
import os
from pathlib import Path

# Check current working directory
print(os.getcwd())

# Set working directory
os.chdir("/Users/giulia/Projects/minimum-wage-study")

# Better: use pathlib for cross-platform paths
project_root = Path(__file__).parent.parent  # Go up from code/ folder
data_path = project_root / "data" / "raw" / "cps_2019.csv"

8.4 The Master Script

The master script is the single entry point that reproduces your entire analysis. A reviewer or co-author should be able to run this one file and regenerate all results.

What to look for in the code below:
  • Clear header with project info and instructions
  • One location to set the root path (the only absolute path in the entire project)
  • Global macros for all folder paths
  • Sequential execution of all sub-scripts
  • Timestamps and logging
/*==============================================================================
                    MASTER DO-FILE: Minimum Wage and Employment

    Paper:      "The Employment Effects of Minimum Wage Increases"
    Authors:    Smith, J. and Jones, M.
    Version:    1.0
    Date:       January 2025

    INSTRUCTIONS:
    1. Set the root path below to your project folder location
    2. Ensure all required packages are installed (see below)
    3. Run this entire file to reproduce all results

    REQUIRED PACKAGES: estout, reghdfe, ftools, coefplot
    Install with: ssc install estout, replace
================================================================================*/

clear all
set more off
cap log close

*------------------------------------------------------------------------------*
*  SET ROOT PATH - MODIFY THIS LINE ONLY
*------------------------------------------------------------------------------*
* Users should change this path to their own project folder location
if "`c(username)'" == "giulia" {
    global root "/Users/giulia/Projects/minimum-wage-study"
}
else if "`c(username)'" == "john" {
    global root "C:/Users/john/Research/minimum-wage-study"
}
else {
    * Default: assume current directory is project root
    global root "`c(pwd)'"
}

*------------------------------------------------------------------------------*
*  DEFINE FOLDER PATHS (do not modify)
*------------------------------------------------------------------------------*
global data     "$root/data"
global raw      "$data/raw"
global processed "$data/processed"
global code     "$root/code"
global output   "$root/output"
global tables   "$output/tables"
global figures  "$output/figures"
global logs     "$output/logs"

*------------------------------------------------------------------------------*
*  CREATE FOLDERS IF THEY DON'T EXIST
*------------------------------------------------------------------------------*
cap mkdir "$processed"
cap mkdir "$output"
cap mkdir "$tables"
cap mkdir "$figures"
cap mkdir "$logs"

*------------------------------------------------------------------------------*
*  START LOG FILE
*------------------------------------------------------------------------------*
local datetime: di %tcCCYY-NN-DD-HH-MM-SS Clock("`c(current_date)' `c(current_time)'","DMYhms")
log using "$logs/master_log_`datetime'.txt", text replace

di "=============================================="
di "  MASTER SCRIPT STARTED"
di "  Date/Time: `c(current_date)' `c(current_time)'"
di "  Stata version: `c(stata_version)'"
di "  Root directory: $root"
di "=============================================="

*------------------------------------------------------------------------------*
*  RUN ANALYSIS SCRIPTS IN ORDER
*------------------------------------------------------------------------------*
di _n ">>> Running 01_import_clean.do..."
do "$code/01_import_clean.do"

di _n ">>> Running 02_merge_construct.do..."
do "$code/02_merge_construct.do"

di _n ">>> Running 03_descriptive_stats.do..."
do "$code/03_descriptive_stats.do"

di _n ">>> Running 04_main_analysis.do..."
do "$code/04_main_analysis.do"

di _n ">>> Running 05_robustness.do..."
do "$code/05_robustness.do"

*------------------------------------------------------------------------------*
*  COMPLETION
*------------------------------------------------------------------------------*
di _n "=============================================="
di "  MASTER SCRIPT COMPLETED SUCCESSFULLY"
di "  End time: `c(current_date)' `c(current_time)'"
di "  All outputs saved to: $output"
di "=============================================="

log close
#===============================================================================
#                   MASTER SCRIPT: Minimum Wage and Employment
#
#   Paper:      "The Employment Effects of Minimum Wage Increases"
#   Authors:    Smith, J. and Jones, M.
#   Version:    1.0
#   Date:       January 2025
#
#   INSTRUCTIONS:
#   1. Open the .Rproj file in RStudio (sets working directory automatically)
#   2. Run this entire script to reproduce all results
#
#   REQUIRED PACKAGES: tidyverse, fixest, modelsummary, here
#===============================================================================

# Clear environment
rm(list = ls())

# Load packages (install if needed)
required_packages <- c("tidyverse", "fixest", "modelsummary", "here", "haven")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

library(tidyverse)
library(fixest)
library(modelsummary)
library(here)
library(haven)

#------------------------------------------------------------------------------
#  DEFINE PATHS (using here package - no manual path setting needed!)
#------------------------------------------------------------------------------
paths <- list(
  raw       = here("data", "raw"),
  processed = here("data", "processed"),
  tables    = here("output", "tables"),
  figures   = here("output", "figures"),
  logs      = here("output", "logs")
)

# Create directories if they don't exist
walk(paths, ~dir.create(.x, recursive = TRUE, showWarnings = FALSE))

#------------------------------------------------------------------------------
#  START LOG
#------------------------------------------------------------------------------
sink(file.path(paths$logs, paste0("master_log_", Sys.Date(), ".txt")))
cat("==============================================\n")
cat("  MASTER SCRIPT STARTED\n")
cat("  Date/Time:", format(Sys.time()), "\n")
cat("  R version:", R.version.string, "\n")
cat("  Project root:", here(), "\n")
cat("==============================================\n\n")

#------------------------------------------------------------------------------
#  RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
cat("\n>>> Running 01_import_clean.R...\n")
source(here("code", "01_import_clean.R"))

cat("\n>>> Running 02_merge_construct.R...\n")
source(here("code", "02_merge_construct.R"))

cat("\n>>> Running 03_descriptive_stats.R...\n")
source(here("code", "03_descriptive_stats.R"))

cat("\n>>> Running 04_main_analysis.R...\n")
source(here("code", "04_main_analysis.R"))

cat("\n>>> Running 05_robustness.R...\n")
source(here("code", "05_robustness.R"))

#------------------------------------------------------------------------------
#  COMPLETION
#------------------------------------------------------------------------------
cat("\n==============================================\n")
cat("  MASTER SCRIPT COMPLETED SUCCESSFULLY\n")
cat("  End time:", format(Sys.time()), "\n")
cat("  All outputs saved to:", here("output"), "\n")
cat("==============================================\n")
sink()
"""
================================================================================
                    MASTER SCRIPT: Minimum Wage and Employment

    Paper:      "The Employment Effects of Minimum Wage Increases"
    Authors:    Smith, J. and Jones, M.
    Version:    1.0
    Date:       January 2025

    INSTRUCTIONS:
    1. Navigate to project folder in terminal
    2. Run: python code/00_master.py

    REQUIRED PACKAGES: pandas, numpy, statsmodels, matplotlib, linearmodels
    Install with: pip install -r requirements.txt
================================================================================
"""

import os
import sys
from pathlib import Path
from datetime import datetime
import subprocess

#------------------------------------------------------------------------------
#  DEFINE PATHS
#------------------------------------------------------------------------------
# Get project root (parent of code folder)
ROOT = Path(__file__).resolve().parent.parent

PATHS = {
    'raw': ROOT / 'data' / 'raw',
    'processed': ROOT / 'data' / 'processed',
    'tables': ROOT / 'output' / 'tables',
    'figures': ROOT / 'output' / 'figures',
    'logs': ROOT / 'output' / 'logs',
    'code': ROOT / 'code'
}

# Create directories if they don't exist
for path in PATHS.values():
    path.mkdir(parents=True, exist_ok=True)

#------------------------------------------------------------------------------
#  LOGGING SETUP
#------------------------------------------------------------------------------
log_file = PATHS['logs'] / f"master_log_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}.txt"

class Logger:
    def __init__(self, filename):
        self.terminal = sys.stdout
        self.log = open(filename, 'w')

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

    def flush(self):
        self.terminal.flush()
        self.log.flush()

sys.stdout = Logger(log_file)

#------------------------------------------------------------------------------
#  START
#------------------------------------------------------------------------------
print("=" * 50)
print("  MASTER SCRIPT STARTED")
print(f"  Date/Time: {datetime.now()}")
print(f"  Python version: {sys.version}")
print(f"  Project root: {ROOT}")
print("=" * 50)

#------------------------------------------------------------------------------
#  RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
scripts = [
    '01_import_clean.py',
    '02_merge_construct.py',
    '03_descriptive_stats.py',
    '04_main_analysis.py',
    '05_robustness.py'
]

for script in scripts:
    print(f"\n>>> Running {script}...")
    script_path = PATHS['code'] / script
    exec(open(script_path).read())

#------------------------------------------------------------------------------
#  COMPLETION
#------------------------------------------------------------------------------
print("\n" + "=" * 50)
print("  MASTER SCRIPT COMPLETED SUCCESSFULLY")
print(f"  End time: {datetime.now()}")
print(f"  All outputs saved to: {ROOT / 'output'}")
print("=" * 50)

8.5 Documentation: The README

Every project needs a README answering: What is this? What data? What software? How to run?

README Template:
# Project Title

## Overview
Brief description of the research question and findings.

## Data
- **Source**: Where data comes from (e.g., "CPS via IPUMS")
- **Access**: How to obtain it (include URLs or instructions)
- **Files**: List of data files and what they contain

## Software Requirements
- Stata 17 or later
- Required packages: estout, reghdfe, ftools

## Instructions
1. Clone or download this repository
2. Open `code/00_master.do`
3. Set the root path on line 25 to your project folder
4. Run the master file

## File Structure
[Include folder tree here]

## Output
Running the master script will generate:
- Tables 1-5 in `output/tables/`
- Figures 1-3 in `output/figures/`

## Contact
Your Name (email@university.edu)

8.6 Automated LaTeX Tables

Never copy-paste results into tables manually. Export publication-ready LaTeX tables directly from your statistical software.

❌ The Wrong Way (Copy-Paste):
  1. Run regression in Stata
  2. Copy numbers from output window
  3. Paste into Excel
  4. Format in Excel
  5. Copy to Word or manually type into LaTeX

Problems: Error-prone, time-consuming, not reproducible, nightmare when results change

✓ The Right Way (Automated Export):
  1. Run regression in Stata
  2. Export to .tex file with one command
  3. LaTeX imports table automatically
  4. Results change? Just re-run the code!

Stata: esttab / estout

The estout package is the gold standard for exporting Stata results to LaTeX.

* Install the package (once)
ssc install estout, replace

* Run your regressions and store estimates
reg earnings education age, robust
estimates store m1

reg earnings education age experience, robust
estimates store m2

reg earnings education age experience i.industry, robust
estimates store m3

* Export to LaTeX
esttab m1 m2 m3 using "$tables/main_results.tex", ///
    replace                                        ///
    label                                          /// Use variable labels
    b(3) se(3)                                     /// 3 decimal places
    star(* 0.10 ** 0.05 *** 0.01)                 /// Significance stars
    title("Effect of Education on Earnings")       ///
    mtitles("Basic" "Controls" "Industry FE")      /// Column titles
    scalars("r2 R-squared" "N Observations")       ///
    nonotes                                        ///
    addnotes("Standard errors in parentheses."    ///
             "* p<0.10, ** p<0.05, *** p<0.01")

This produces a complete LaTeX table that looks like:

Table 1: Effect of Education on Earnings
(1) Basic (2) Controls (3) Industry FE
Education (years) 0.089*** 0.074*** 0.068***
(0.003) (0.003) (0.004)
Age 0.012*** 0.008*** 0.007***
(0.001) (0.001) (0.001)
R-squared 0.142 0.187 0.234
Observations 45,231 45,231 45,231

Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01

R: modelsummary

library(modelsummary)
library(fixest)

# Run regressions
m1 <- feols(earnings ~ education + age, data = df)
m2 <- feols(earnings ~ education + age + experience, data = df)
m3 <- feols(earnings ~ education + age + experience | industry, data = df)

# Export to LaTeX
modelsummary(
  list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
  output = here("output", "tables", "main_results.tex"),
  stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
  coef_rename = c("education" = "Education (years)", "age" = "Age"),
  gof_omit = "AIC|BIC|Log",
  title = "Effect of Education on Earnings",
  notes = "Standard errors in parentheses."
)

Python: stargazer

from stargazer.stargazer import Stargazer
import statsmodels.api as sm

# Run regressions
X1 = sm.add_constant(df[['education', 'age']])
m1 = sm.OLS(df['earnings'], X1).fit(cov_type='HC1')

X2 = sm.add_constant(df[['education', 'age', 'experience']])
m2 = sm.OLS(df['earnings'], X2).fit(cov_type='HC1')

# Export to LaTeX
stargazer = Stargazer([m1, m2])
stargazer.title("Effect of Education on Earnings")
stargazer.custom_columns(["Basic", "Controls"])

with open(PATHS['tables'] / 'main_results.tex', 'w') as f:
    f.write(stargazer.render_latex())

Including Tables in LaTeX

In your paper's .tex file, simply input the generated table:

\begin{table}[htbp]
\centering
\input{../output/tables/main_results.tex}
\end{table}

8.7 One-Click Pipeline: Full Integration

The ultimate goal is a workflow where one command reproduces your entire paper—data cleaning, analysis, tables, figures, and the compiled PDF.

The Dream Workflow

📊
Raw Data
⚙️
Master Script
📈
Tables & Figures
📄
Compiled PDF

One command. Fully reproducible. No manual intervention.

Option 1: Makefile (Most Powerful)

A Makefile specifies dependencies and only re-runs what's needed:

# Makefile for research project
# Run with: make all

.PHONY: all clean data analysis paper

# Default target
all: paper

# Data processing (depends on raw data)
data/processed/clean_data.dta: data/raw/cps_2019.dta code/01_import_clean.do
	stata-mp -b do code/01_import_clean.do

# Analysis (depends on processed data)
output/tables/main_results.tex: data/processed/clean_data.dta code/04_main_analysis.do
	stata-mp -b do code/04_main_analysis.do

output/figures/figure1.pdf: data/processed/clean_data.dta code/04_main_analysis.do
	stata-mp -b do code/04_main_analysis.do

# Paper (depends on tables and figures)
paper/main.pdf: paper/main.tex output/tables/main_results.tex output/figures/figure1.pdf
	cd paper && pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex

paper: paper/main.pdf

# Clean all generated files
clean:
	rm -f data/processed/*.dta
	rm -f output/tables/*.tex
	rm -f output/figures/*.pdf
	rm -f paper/*.pdf paper/*.aux paper/*.log paper/*.bbl paper/*.blg

Option 2: Shell Script (Simpler)

#!/bin/bash
# run_all.sh - Reproduce entire paper

echo "Starting full pipeline..."
echo "========================="

# Run Stata analysis
echo "Step 1: Running Stata analysis..."
stata-mp -b do code/00_master.do

# Run Python visualizations (if any)
echo "Step 2: Running Python scripts..."
python code/06_visualizations.py

# Compile LaTeX paper
echo "Step 3: Compiling paper..."
cd paper
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex

echo "========================="
echo "Done! Paper saved to paper/main.pdf"

Option 3: Overleaf + Dropbox/GitHub Sync

For teams using Overleaf, you can sync your output folder:

  1. Link Overleaf to GitHub: Overleaf can pull from a GitHub repository
  2. Structure: Put your output/ folder in the repo
  3. Workflow:
    • Run analysis locally → tables/figures saved to output/
    • Push to GitHub
    • Pull in Overleaf → paper auto-compiles with new results
# After running your analysis
git add output/tables/*.tex output/figures/*.pdf
git commit -m "Update results"
git push

# Overleaf syncs automatically if linked to the repo

8.8 LaTeX Paper Structure

Large papers should be split into sections for easier editing and collaboration. Use a master file that inputs each section.

Project Structure

📁 paper/
├── main.tex ← Master file (compile this)
├── preamble.tex ← Packages and settings
├── sections/
├── 01_introduction.tex
├── 02_literature.tex
├── 03_data.tex
├── 04_methodology.tex
├── 05_results.tex
├── 06_conclusion.tex
└── appendix.tex
├── bibliography.bib
└── figures/ ← Or symlink to ../output/figures

Master File (main.tex)

% main.tex - Master file for paper
\documentclass[12pt]{article}

% Load preamble (packages, settings, custom commands)
\input{preamble}

\title{The Employment Effects of Minimum Wage Increases}
\author{Jane Smith\thanks{University of Example. Email: jsmith@example.edu}}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
Your abstract here...
\end{abstract}

% Main content - each section in separate file
\input{sections/01_introduction}
\input{sections/02_literature}
\input{sections/03_data}
\input{sections/04_methodology}
\input{sections/05_results}
\input{sections/06_conclusion}

% Bibliography
\bibliographystyle{aer}
\bibliography{bibliography}

% Appendix
\appendix
\input{sections/appendix}

\end{document}

Preamble File (preamble.tex)

% preamble.tex - Packages and settings

% Essential packages
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs}        % Professional tables
\usepackage{natbib}          % Citations
\usepackage{hyperref}        % Clickable links
\usepackage{setspace}        % Line spacing
\usepackage[margin=1in]{geometry}

% For including Stata/R output tables
\usepackage{threeparttable}  % Table notes
\usepackage{tabularx}        % Flexible columns
\usepackage{siunitx}         % Number alignment

% Path to figures and tables (relative to main.tex)
\graphicspath{{../output/figures/}{figures/}}

% Custom commands for convenience
\newcommand{\tabpath}{../output/tables/}

% Double spacing for submission
\doublespacing

Including Generated Tables

% In sections/05_results.tex

\section{Results}

Table \ref{tab:main} presents our main results...

\begin{table}[htbp]
\centering
\caption{Effect of Minimum Wage on Employment}
\label{tab:main}
\input{\tabpath main_results.tex}
\end{table}

The coefficient on minimum wage in column (3) suggests that...

Compiling the Paper

# From the paper/ directory
cd paper

# Full compilation with bibliography
pdflatex main.tex       # First pass
bibtex main             # Process citations
pdflatex main.tex       # Resolve references
pdflatex main.tex       # Final pass

# Or use latexmk (handles everything automatically)
latexmk -pdf main.tex
VS Code + LaTeX Workshop: Install the LaTeX Workshop extension in VS Code for:
  • Auto-compilation on save
  • PDF preview side-by-side
  • Syntax highlighting and snippets
  • Forward/inverse search (click in PDF → go to code)

8.9 Replicating Published Papers

Where to find replication materials:

Steps to replicate:

  1. Find and download replication package
  2. Read the paper and README carefully
  3. Install required software/packages
  4. Run the master script
  5. Compare output to published tables
Essential Reading

Gentzkow, M. & Shapiro, J. (2014). "Code and Data for the Social Sciences"