8  Replicability & Reproducibility

~6 hours Research Best Practices Intermediate

Learning Objectives

  • Understand the difference between replicability and reproducibility
  • Organize research projects with clear folder structures
  • Write documentation that enables others to run your code
  • Create publication-ready replication packages
  • Successfully replicate published economics papers

8.1 Definitions

ConceptDefinitionQuestion
ReproducibilitySame data + same code = same resultsCan I re-run this and get identical output?
ReplicabilityNew data + same methods = consistent findingsDoes this hold in a different sample?
Why This Matters: Major economics journals (AER, QJE, Econometrica) now require replication packages. A 2019 study found that only 37% of published papers could be successfully reproduced. Good practices from day one save enormous time later and increase your credibility as a researcher.

8.2 Project Organization

A well-organized project folder is the foundation of reproducible research. Here's what a complete research project should look like:

📁 minimum-wage-study/
├── 📄 README.md ← Start here! Instructions for everything
├── 📄 LICENSE ← How others can use your code
├── 📁 data/
├── 📁 raw/ ← NEVER modify these files!
├── cps_2019.dta
└── state_minimum_wages.csv
├── 📁 processed/ ← Generated by code
└── 📄 data_dictionary.md ← Variable definitions
├── 📁 code/
├── 00_master.do ← Run THIS to reproduce everything
├── 01_import_clean.do
├── 02_merge_construct.do
├── 03_descriptive_stats.do
├── 04_main_analysis.do
└── 05_robustness.do
├── 📁 output/
├── 📁 tables/ ← .tex files for LaTeX
├── 📁 figures/ ← .pdf or .png graphs
└── 📁 logs/ ← Stata log files
└── 📁 paper/
├── main.tex ← Master LaTeX file
├── sections/
└── bibliography.bib

Key Principles

  • Raw data is sacred: Never modify original files—all cleaning happens in code
  • One master script: Runs entire analysis end-to-end with a single click
  • Relative paths only: Never use absolute paths like C:\Users\...
  • Number your scripts: Makes execution order obvious (00, 01, 02...)
  • Version everything: Use Git (see Module 9)
  • Separate outputs: Tables, figures, and logs in their own folders

8.3 Understanding File Paths

File paths tell your computer where to find files. Understanding them is essential for writing code that works on any machine.

Absolute vs. Relative Paths

TypeExampleVerdict
Absolute Path C:\Users\Maria\Projects\thesis\data\raw\survey.dta ❌ Never use in code
Relative Path data/raw/survey.dta ✓ Always use this

An absolute path specifies the complete location from the root of the file system — it starts with / on Mac/Linux (e.g., /Users/giulia/Projects/thesis/) or a drive letter on Windows (e.g., C:\Users\...). A relative path starts from your current location (the "working directory") and contains no root prefix — think of it like giving directions from where you already are: "go to the data folder, then the raw subfolder, then find survey.dta."

Common misconception: The type of slash (/ vs. \) does not determine whether a path is absolute or relative. Both /Users/giulia/data.csv (forward slash) and C:\Users\Giulia\data.csv (backslash) are absolute paths; both data/raw/survey.csv and data\raw\survey.csv are relative paths. What matters is whether the path starts from the root of the file system or from the current working directory — not which slash separates the folders.

Path Syntax: Mac/Linux vs. Windows

# Mac/Linux use forward slashes /
/Users/giulia/Projects/thesis/          # Absolute path
data/raw/survey.csv                      # Relative path

# Navigate up one folder with ..
../other_project/data.csv               # Go up, then into other_project

# Home directory shortcut
~/Projects/thesis/                       # ~ means your home folder
# Windows traditionally uses backslashes \
C:\Users\Giulia\Projects\thesis\        # Absolute path
data\raw\survey.csv                      # Relative path

# But forward slashes work in most modern software!
data/raw/survey.csv                      # This works in Stata, Python, R

# Navigate up one folder with ..
..\other_project\data.csv               # Go up, then into other_project
Pro Tip: Use forward slashes (/) everywhere, even on Windows. Stata, Python, and R all understand them, making your code cross-platform compatible.

Setting the Working Directory

The working directory is your "home base"—all relative paths start from here. Set it once at the beginning of your master script.

* Check current working directory
pwd

* Set working directory (do this ONCE in master script)
cd "/Users/giulia/Projects/minimum-wage-study"

* Now all paths are relative to this location
use "data/raw/cps_2019.dta", clear
export delimited "data/processed/clean_data.csv"
# Check current working directory
getwd()

# Set working directory
setwd("/Users/giulia/Projects/minimum-wage-study")

# Better: use here package (finds project root automatically)
library(here)
data <- read_csv(here("data", "raw", "cps_2019.csv"))

# Or use RStudio Projects (.Rproj files)
import os
from pathlib import Path

# Check current working directory
print(os.getcwd())

# Set working directory
os.chdir("/Users/giulia/Projects/minimum-wage-study")

# Better: use pathlib for cross-platform paths
project_root = Path(__file__).parent.parent  # Go up from code/ folder
data_path = project_root / "data" / "raw" / "cps_2019.csv"
Stata Output
. pwd /Users/giulia/Projects/minimum-wage-study . cd "/Users/giulia/Projects/minimum-wage-study" /Users/giulia/Projects/minimum-wage-study . use "data/raw/cps_2019.dta", clear (45,231 observations read) . export delimited "data/processed/clean_data.csv" file data/processed/clean_data.csv saved
R Output
[1] "/Users/giulia/Projects/minimum-wage-study" here() starts at /Users/giulia/Projects/minimum-wage-study Rows: 45231 Columns: 12 ── Column specification ───────────────────────────────── Delimiter: "," chr (3): state, occupation, industry dbl (9): year, age, education, experience, income, ... ℹ Use `spec()` to retrieve the full column specification
Python Output
/Users/giulia/Projects/minimum-wage-study # project_root: PosixPath('/Users/giulia/Projects/minimum-wage-study') # data_path: PosixPath('/Users/giulia/Projects/minimum-wage-study/data/raw/cps_2019.csv')

8.4 The Master Script

The master script is the single entry point that reproduces your entire analysis. A reviewer or co-author should be able to run this one file and regenerate all results.

What to look for in the code below:
  • Clear header with project info and instructions
  • One location to set the root path (the only absolute path in the entire project)
  • Global macros for all folder paths
  • Sequential execution of all sub-scripts
  • Timestamps and logging
/*==============================================================================
                    MASTER DO-FILE: Minimum Wage and Employment

    Paper:      "The Employment Effects of Minimum Wage Increases"
    Authors:    Smith, J. and Jones, M.
    Version:    1.0
    Date:       January 2025

    INSTRUCTIONS:
    1. Set the root path below to your project folder location
    2. Ensure all required packages are installed (see below)
    3. Run this entire file to reproduce all results

    REQUIRED PACKAGES: estout, reghdfe, ftools, coefplot
    Install with: ssc install estout, replace
================================================================================*/

clear all
set more off
cap log close

*------------------------------------------------------------------------------*
*  SET ROOT PATH - MODIFY THIS LINE ONLY
*------------------------------------------------------------------------------*
* Users should change this path to their own project folder location
if "`c(username)'" == "giulia" {
    global root "/Users/giulia/Projects/minimum-wage-study"
}
else if "`c(username)'" == "john" {
    global root "C:/Users/john/Research/minimum-wage-study"
}
else {
    * Default: assume current directory is project root
    global root "`c(pwd)'"
}

*------------------------------------------------------------------------------*
*  DEFINE FOLDER PATHS (do not modify)
*------------------------------------------------------------------------------*
global data     "$root/data"
global raw      "$data/raw"
global processed "$data/processed"
global code     "$root/code"
global output   "$root/output"
global tables   "$output/tables"
global figures  "$output/figures"
global logs     "$output/logs"

*------------------------------------------------------------------------------*
*  CREATE FOLDERS IF THEY DON'T EXIST
*------------------------------------------------------------------------------*
cap mkdir "$processed"
cap mkdir "$output"
cap mkdir "$tables"
cap mkdir "$figures"
cap mkdir "$logs"

*------------------------------------------------------------------------------*
*  START LOG FILE
*------------------------------------------------------------------------------*
local datetime: di %tcCCYY-NN-DD-HH-MM-SS Clock("`c(current_date)' `c(current_time)'","DMYhms")
log using "$logs/master_log_`datetime'.txt", text replace

di "=============================================="
di "  MASTER SCRIPT STARTED"
di "  Date/Time: `c(current_date)' `c(current_time)'"
di "  Stata version: `c(stata_version)'"
di "  Root directory: $root"
di "=============================================="

*------------------------------------------------------------------------------*
*  RUN ANALYSIS SCRIPTS IN ORDER
*------------------------------------------------------------------------------*
di _n ">>> Running 01_import_clean.do..."
do "$code/01_import_clean.do"

di _n ">>> Running 02_merge_construct.do..."
do "$code/02_merge_construct.do"

di _n ">>> Running 03_descriptive_stats.do..."
do "$code/03_descriptive_stats.do"

di _n ">>> Running 04_main_analysis.do..."
do "$code/04_main_analysis.do"

di _n ">>> Running 05_robustness.do..."
do "$code/05_robustness.do"

*------------------------------------------------------------------------------*
*  COMPLETION
*------------------------------------------------------------------------------*
di _n "=============================================="
di "  MASTER SCRIPT COMPLETED SUCCESSFULLY"
di "  End time: `c(current_date)' `c(current_time)'"
di "  All outputs saved to: $output"
di "=============================================="

log close
#===============================================================================
#                   MASTER SCRIPT: Minimum Wage and Employment
#
#   Paper:      "The Employment Effects of Minimum Wage Increases"
#   Authors:    Smith, J. and Jones, M.
#   Version:    1.0
#   Date:       January 2025
#
#   INSTRUCTIONS:
#   1. Open the .Rproj file in RStudio (sets working directory automatically)
#   2. Run this entire script to reproduce all results
#
#   REQUIRED PACKAGES: tidyverse, fixest, modelsummary, here
#===============================================================================

# Clear environment
rm(list = ls())

# Load packages (install if needed)
required_packages <- c("tidyverse", "fixest", "modelsummary", "here", "haven")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

library(tidyverse)
library(fixest)
library(modelsummary)
library(here)
library(haven)

#------------------------------------------------------------------------------
#  DEFINE PATHS (using here package - no manual path setting needed!)
#------------------------------------------------------------------------------
paths <- list(
  raw       = here("data", "raw"),
  processed = here("data", "processed"),
  tables    = here("output", "tables"),
  figures   = here("output", "figures"),
  logs      = here("output", "logs")
)

# Create directories if they don't exist
walk(paths, ~dir.create(.x, recursive = TRUE, showWarnings = FALSE))

#------------------------------------------------------------------------------
#  START LOG
#------------------------------------------------------------------------------
sink(file.path(paths$logs, paste0("master_log_", Sys.Date(), ".txt")))
cat("==============================================\n")
cat("  MASTER SCRIPT STARTED\n")
cat("  Date/Time:", format(Sys.time()), "\n")
cat("  R version:", R.version.string, "\n")
cat("  Project root:", here(), "\n")
cat("==============================================\n\n")

#------------------------------------------------------------------------------
#  RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
cat("\n>>> Running 01_import_clean.R...\n")
source(here("code", "01_import_clean.R"))

cat("\n>>> Running 02_merge_construct.R...\n")
source(here("code", "02_merge_construct.R"))

cat("\n>>> Running 03_descriptive_stats.R...\n")
source(here("code", "03_descriptive_stats.R"))

cat("\n>>> Running 04_main_analysis.R...\n")
source(here("code", "04_main_analysis.R"))

cat("\n>>> Running 05_robustness.R...\n")
source(here("code", "05_robustness.R"))

#------------------------------------------------------------------------------
#  COMPLETION
#------------------------------------------------------------------------------
cat("\n==============================================\n")
cat("  MASTER SCRIPT COMPLETED SUCCESSFULLY\n")
cat("  End time:", format(Sys.time()), "\n")
cat("  All outputs saved to:", here("output"), "\n")
cat("==============================================\n")
sink()
"""
================================================================================
                    MASTER SCRIPT: Minimum Wage and Employment

    Paper:      "The Employment Effects of Minimum Wage Increases"
    Authors:    Smith, J. and Jones, M.
    Version:    1.0
    Date:       January 2025

    INSTRUCTIONS:
    1. Navigate to project folder in terminal
    2. Run: python code/00_master.py

    REQUIRED PACKAGES: pandas, numpy, statsmodels, matplotlib, linearmodels
    Install with: pip install -r requirements.txt
================================================================================
"""

import os
import sys
from pathlib import Path
from datetime import datetime
import subprocess

#------------------------------------------------------------------------------
#  DEFINE PATHS
#------------------------------------------------------------------------------
# Get project root (parent of code folder)
ROOT = Path(__file__).resolve().parent.parent

PATHS = {
    'raw': ROOT / 'data' / 'raw',
    'processed': ROOT / 'data' / 'processed',
    'tables': ROOT / 'output' / 'tables',
    'figures': ROOT / 'output' / 'figures',
    'logs': ROOT / 'output' / 'logs',
    'code': ROOT / 'code'
}

# Create directories if they don't exist
for path in PATHS.values():
    path.mkdir(parents=True, exist_ok=True)

#------------------------------------------------------------------------------
#  LOGGING SETUP
#------------------------------------------------------------------------------
log_file = PATHS['logs'] / f"master_log_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}.txt"

class Logger:
    def __init__(self, filename):
        self.terminal = sys.stdout
        self.log = open(filename, 'w')

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

    def flush(self):
        self.terminal.flush()
        self.log.flush()

sys.stdout = Logger(log_file)

#------------------------------------------------------------------------------
#  START
#------------------------------------------------------------------------------
print("=" * 50)
print("  MASTER SCRIPT STARTED")
print(f"  Date/Time: {datetime.now()}")
print(f"  Python version: {sys.version}")
print(f"  Project root: {ROOT}")
print("=" * 50)

#------------------------------------------------------------------------------
#  RUN ANALYSIS SCRIPTS IN ORDER
#------------------------------------------------------------------------------
scripts = [
    '01_import_clean.py',
    '02_merge_construct.py',
    '03_descriptive_stats.py',
    '04_main_analysis.py',
    '05_robustness.py'
]

for script in scripts:
    print(f"\n>>> Running {script}...")
    script_path = PATHS['code'] / script
    exec(open(script_path).read())

#------------------------------------------------------------------------------
#  COMPLETION
#------------------------------------------------------------------------------
print("\n" + "=" * 50)
print("  MASTER SCRIPT COMPLETED SUCCESSFULLY")
print(f"  End time: {datetime.now()}")
print(f"  All outputs saved to: {ROOT / 'output'}")
print("=" * 50)
Stata Output
. log using "$output/logs/master_log_2025-01-27.txt", replace text name: <unnamed> log: /Users/giulia/Projects/minimum-wage-study/output/logs/master_log_2025-01-27.txt log type: text ============================================== MASTER SCRIPT STARTED Date/Time: 27 Jan 2025 14:32:15 Stata version: 17.0 Root directory: /Users/giulia/Projects/minimum-wage-study ============================================== >>> Running 01_import_clean.do... (45,231 observations read) Variable education recoded to years (23 missing values generated) >>> Running 02_merge_construct.do... (file merged using state_fips) Result: 45,231 matched, 0 unmatched >>> Running 03_descriptive_stats.do... Table 1 saved to output/tables/summary_stats.tex Figure 1 saved to output/figures/wage_distribution.pdf >>> Running 04_main_analysis.do... Table 2 saved to output/tables/main_results.tex Table 3 saved to output/tables/heterogeneity.tex >>> Running 05_robustness.do... Table A1 saved to output/tables/robustness.tex ============================================== MASTER SCRIPT COMPLETED SUCCESSFULLY End time: 27 Jan 2025 14:35:42 All outputs saved to: /Users/giulia/Projects/minimum-wage-study/output ============================================== (log closed)
R Output
============================================== MASTER SCRIPT STARTED Date/Time: 2025-01-27 14:32:15 R version: R version 4.3.2 (2023-10-31) Project root: /Users/giulia/Projects/minimum-wage-study ============================================== >>> Running 01_import_clean.R... ── Importing raw data ── Rows: 45231 Columns: 12 Cleaning complete: 45,208 observations retained >>> Running 02_merge_construct.R... Merging state-level data... Join: 45,208 x 45,208 (all matched) >>> Running 03_descriptive_stats.R... Summary statistics saved to output/tables/summary_stats.tex Wage distribution plot saved to output/figures/wage_distribution.pdf >>> Running 04_main_analysis.R... Main regression results saved to output/tables/main_results.tex Heterogeneity analysis saved to output/tables/heterogeneity.tex >>> Running 05_robustness.R... Robustness checks saved to output/tables/robustness.tex ============================================== MASTER SCRIPT COMPLETED SUCCESSFULLY End time: 2025-01-27 14:35:42 All outputs saved to: /Users/giulia/Projects/minimum-wage-study/output ==============================================
Python Output
================================================== MASTER SCRIPT STARTED Date/Time: 2025-01-27 14:32:15.234567 Python version: 3.11.5 (main, Aug 24 2023) Project root: /Users/giulia/Projects/minimum-wage-study ================================================== >>> Running 01_import_clean.py... Loading raw data... done (45,231 rows) Cleaning data... done (45,208 rows retained) >>> Running 02_merge_construct.py... Merging state data... done Final dataset: 45,208 observations >>> Running 03_descriptive_stats.py... Summary statistics exported to output/tables/summary_stats.tex Figure saved to output/figures/wage_distribution.pdf >>> Running 04_main_analysis.py... Running main regressions... Results exported to output/tables/main_results.tex >>> Running 05_robustness.py... Running robustness checks... Results exported to output/tables/robustness.tex ================================================== MASTER SCRIPT COMPLETED SUCCESSFULLY End time: 2025-01-27 14:35:42.891234 All outputs saved to: /Users/giulia/Projects/minimum-wage-study/output ==================================================

8.5 Documentation: The README

Every project needs a README answering: What is this? What data? What software? How to run?

README Template:
# Project Title

## Overview
Brief description of the research question and findings.

## Data
- **Source**: Where data comes from (e.g., "CPS via IPUMS")
- **Access**: How to obtain it (include URLs or instructions)
- **Files**: List of data files and what they contain

## Software Requirements
- Stata 17 or later
- Required packages: estout, reghdfe, ftools

## Instructions
1. Clone or download this repository
2. Open `code/00_master.do`
3. Set the root path on line 25 to your project folder
4. Run the master file

## File Structure
[Include folder tree here]

## Output
Running the master script will generate:
- Tables 1-5 in `output/tables/`
- Figures 1-3 in `output/figures/`

## Contact
Your Name (email@university.edu)

8.6 Automated LaTeX Tables

Never copy-paste results into tables manually. Export publication-ready LaTeX tables directly from your statistical software.

❌ The Wrong Way (Copy-Paste):
  1. Run regression in Stata
  2. Copy numbers from output window
  3. Paste into Excel
  4. Format in Excel
  5. Copy to Word or manually type into LaTeX

Problems: Error-prone, time-consuming, not reproducible, nightmare when results change

✓ The Right Way (Automated Export):
  1. Run regression in Stata
  2. Export to .tex file with one command
  3. LaTeX imports table automatically
  4. Results change? Just re-run the code!

Stata: esttab / estout

The estout package is the gold standard for exporting Stata results to LaTeX.

* Install the package (once)
ssc install estout, replace

* Run your regressions and store estimates
reg earnings education age, robust
estimates store m1

reg earnings education age experience, robust
estimates store m2

reg earnings education age experience i.industry, robust
estimates store m3

* Export to LaTeX
esttab m1 m2 m3 using "$tables/main_results.tex", ///
    replace                                        ///
    label                                          /// /// Use variable labels
    b(3) se(3)                                     /// /// 3 decimal places
    star(* 0.10 ** 0.05 *** 0.01)                 /// /// Significance stars
    title("Effect of Education on Earnings")       ///
    mtitles("Basic" "Controls" "Industry FE")      /// /// Column titles
    scalars("r2 R-squared" "N Observations")       ///
    nonotes                                        ///
    addnotes("Standard errors in parentheses."    ///
             "* p<0.10, ** p<0.05, *** p<0.01")

This produces a complete LaTeX table that looks like:

Table 1: Effect of Education on Earnings
(1) Basic (2) Controls (3) Industry FE
Education (years) 0.089*** 0.074*** 0.068***
(0.003) (0.003) (0.004)
Age 0.012*** 0.008*** 0.007***
(0.001) (0.001) (0.001)
R-squared 0.142 0.187 0.234
Observations 45,231 45,231 45,231

Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01

R: modelsummary

library(modelsummary)
library(fixest)

# Run regressions
m1 <- feols(earnings ~ education + age, data = df)
m2 <- feols(earnings ~ education + age + experience, data = df)
m3 <- feols(earnings ~ education + age + experience | industry, data = df)

# Export to LaTeX
modelsummary(
  list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
  output = here("output", "tables", "main_results.tex"),
  stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
  coef_rename = c("education" = "Education (years)", "age" = "Age"),
  gof_omit = "AIC|BIC|Log",
  title = "Effect of Education on Earnings",
  notes = "Standard errors in parentheses."
)

modelsummary produces the same table structure as the Stata example above. Here is what the rendered output looks like:

Table 1: Effect of Education on Earnings
(1) Basic (2) Controls (3) Industry FE
Education (years) 0.089*** 0.074*** 0.068***
(0.003) (0.003) (0.004)
Age 0.012*** 0.008*** 0.007***
(0.001) (0.001) (0.001)
R-squared 0.142 0.187 0.234
Observations 45,231 45,231 45,231

Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01

Python: stargazer

from stargazer.stargazer import Stargazer
import statsmodels.api as sm

# Run regressions
X1 = sm.add_constant(df[['education', 'age']])
m1 = sm.OLS(df['earnings'], X1).fit(cov_type='HC1')

X2 = sm.add_constant(df[['education', 'age', 'experience']])
m2 = sm.OLS(df['earnings'], X2).fit(cov_type='HC1')

# Third model: add industry fixed effects via dummies
industry_dummies = pd.get_dummies(df['industry'], drop_first=True, dtype=int)
X3 = sm.add_constant(pd.concat([df[['education', 'age', 'experience']], industry_dummies], axis=1))
m3 = sm.OLS(df['earnings'], X3).fit(cov_type='HC1')

# Export to LaTeX
stargazer = Stargazer([m1, m2, m3])
stargazer.title("Effect of Education on Earnings")
stargazer.custom_columns(["Basic", "Controls", "Industry FE"])

with open(PATHS['tables'] / 'main_results.tex', 'w') as f:
    f.write(stargazer.render_latex())

Python's stargazer produces a similar regression table. Here is what the rendered output looks like:

Table 1: Effect of Education on Earnings
(1) Basic (2) Controls (3) Industry FE
Education (years) 0.089*** 0.074*** 0.068***
(0.003) (0.003) (0.004)
Age 0.012*** 0.008*** 0.007***
(0.001) (0.001) (0.001)
R-squared 0.142 0.187 0.234
Observations 45,231 45,231 45,231

Standard errors in parentheses. * p<0.10, ** p<0.05, *** p<0.01

Including Tables in LaTeX

In your paper's .tex file, simply input the generated table:

\begin{table}[htbp]
\centering
\input{../output/tables/main_results.tex}
\end{table}

What If You Use Word Instead of LaTeX?

Many students write their theses in Word rather than LaTeX, and that is perfectly fine. The important principle is not which word processor you use, but that you never copy-paste numbers manually from your statistical output into your document. All three languages can export tables directly to Word-friendly formats.

# modelsummary can export directly to a Word document
modelsummary(
  list("Basic" = m1, "Controls" = m2, "Industry FE" = m3),
  output = here("output", "tables", "main_results.docx"),
  stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01)
)
* esttab can export to RTF format, which Word opens natively
esttab m1 m2 m3 using "$tables/main_results.rtf", ///
    replace label b(3) se(3)                       ///
    star(* 0.10 ** 0.05 *** 0.01)                 ///
    mtitles("Basic" "Controls" "Industry FE")
# Option 1: Save as HTML, then open in Word
with open(PATHS['tables'] / 'main_results.html', 'w') as f:
    f.write(stargazer.render_html())

# Option 2: For simpler tables, export via pandas
summary_df.to_excel(PATHS['tables'] / 'summary_stats.xlsx', index=False)
Tip: Word tables will not auto-update the way LaTeX \input{} does. However, the workflow is still far better than copy-pasting: whenever your results change, simply re-run your script to regenerate the Word/RTF/HTML file, then replace the old table in your document. This takes seconds and eliminates transcription errors.

8.7 One-Click Pipeline: Full Integration

The ultimate goal is a workflow where one command reproduces your entire paper—data cleaning, analysis, tables, figures, and the compiled PDF.

The Dream Workflow

📊
Raw Data
⚙️
Master Script
📈
Tables & Figures
📄
Compiled PDF

One command. Fully reproducible. No manual intervention.

Option 1: Makefile (Most Powerful)

A Makefile is a special file that tells your computer how to build your project, step by step. Think of a Makefile as a recipe book that knows which ingredients are already prepared. If you have already cleaned your data and nothing has changed since then, make will skip that step and jump straight to the analysis. This saves enormous amounts of time on large projects where a full pipeline might take hours to run.

Researchers use Makefiles because they capture the dependency structure of a project. Each "rule" in the Makefile says: "To produce this output file, I need these input files, and here is the command to run." When you type make all in your terminal, the tool inspects which output files are missing or older than their inputs and re-runs only those steps. Everything else is left untouched.

What is a Makefile, exactly?

Makefiles are not written in any programming language like Python or Stata -- they use their own simple declarative syntax understood by the make utility. Each rule declares an output, its inputs, and a shell command to run. The commands inside each rule (like stata-mp -b do ...) are regular shell/bash commands -- the Makefile just orchestrates when they execute. You type make all in your terminal, from the project's root directory (wherever the Makefile is saved).

On Mac/Linux, make is usually pre-installed. Windows users can get it through Git Bash, WSL (Windows Subsystem for Linux), or by installing GNU Make via Chocolatey.

As you read the code below, pay attention to the pattern on each rule: the output file appears before the colon, the input files appear after the colon, and the command is indented on the next line. Notice how paper/main.pdf depends on both the LaTeX source and the generated tables and figures -- so the paper will only recompile when something it uses has actually changed.

# Makefile for research project
# Run with: make all

.PHONY: all clean data analysis paper

# Default target
all: paper

# Data processing (depends on raw data)
data/processed/clean_data.dta: data/raw/cps_2019.dta code/01_import_clean.do
	stata-mp -b do code/01_import_clean.do

# Analysis (depends on processed data)
output/tables/main_results.tex: data/processed/clean_data.dta code/04_main_analysis.do
	stata-mp -b do code/04_main_analysis.do

output/figures/figure1.pdf: data/processed/clean_data.dta code/04_main_analysis.do
	stata-mp -b do code/04_main_analysis.do

# Paper (depends on tables and figures)
paper/main.pdf: paper/main.tex output/tables/main_results.tex output/figures/figure1.pdf
	cd paper && pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex

paper: paper/main.pdf

# Clean all generated files
clean:
	rm -f data/processed/*.dta
	rm -f output/tables/*.tex
	rm -f output/figures/*.pdf
	rm -f paper/*.pdf paper/*.aux paper/*.log paper/*.bbl paper/*.blg

Option 2: Shell Script (Simpler)

If a Makefile is a smart recipe book, a shell script is a simple checklist: it runs every step from top to bottom, every time, regardless of what has changed. This makes shell scripts easier to write and understand, which is why they are a perfectly good choice for small projects or when your full pipeline runs in a few minutes anyway. You create a file (typically called run_all.sh), list your commands in order, and execute it with bash run_all.sh in your terminal, from the project's root directory (where the script is saved). The downside is that it will redo work even when nothing has changed, but for many student projects that trade-off is fine.

#!/bin/bash
# run_all.sh - Reproduce entire paper

echo "Starting full pipeline..."
echo "========================="

# Run Stata analysis
echo "Step 1: Running Stata analysis..."
stata-mp -b do code/00_master.do

# Run Python visualizations (if any)
echo "Step 2: Running Python scripts..."
python code/06_visualizations.py

# Compile LaTeX paper
echo "Step 3: Compiling paper..."
cd paper
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex

echo "========================="
echo "Done! Paper saved to paper/main.pdf"

Option 3: Overleaf + GitHub Sync

Options 1 and 2 handle the analysis pipeline -- running your code and generating outputs (tables, figures). But if your team writes the paper on Overleaf (a cloud-based LaTeX editor), you also need a way to get those generated tables and figures into your Overleaf project. That is what this option addresses: connecting your local pipeline's outputs to Overleaf via GitHub.

How to connect Overleaf and GitHub

  1. Create a GitHub repository for your project (or use an existing one). Make sure your output/ folder with generated tables and figures is committed to this repo.
  2. In Overleaf, open your project and go to Menu (top-left corner) → GitHubLink to GitHub Repository.
  3. Paste your GitHub repo URL and authorize Overleaf to access it.
  4. Push and pull between Overleaf and GitHub: in Overleaf, use Menu → GitHub → Push/Pull to sync changes in either direction.
  5. Workflow: run your analysis locally (via make all or bash run_all.sh) → outputs saved to output/ folder → git push to GitHub → in Overleaf, pull from GitHub → paper recompiles with new tables and figures.

After running your pipeline, push the updated outputs to GitHub so Overleaf can pull them in. For a complete guide to Git commands like git add, git commit, and git push, see Module 9.

# After running your analysis pipeline
git add output/tables/*.tex output/figures/*.pdf
git commit -m "Update results"
git push

# Then in Overleaf: Menu → GitHub → Pull from GitHub
Key Takeaways from Section 8.7
  • One command, full paper. The goal of a reproducible pipeline is that a single command (like make all or bash run_all.sh) rebuilds everything from raw data to compiled PDF.
  • Makefiles are smart. They track dependencies between files and only re-run steps whose inputs have changed. Best for large or slow-running projects.
  • Shell scripts are simple. They run every step from top to bottom every time. A good starting point for smaller projects where the full pipeline finishes quickly.
  • Version control ties it together. Committing your outputs to Git (and optionally syncing with Overleaf) ensures that every version of your paper corresponds to a traceable set of results.
  • No manual steps. If your workflow requires you to "remember" to copy a file, rename an output, or run things in a particular order by hand, it is not yet fully reproducible. Automate it.

8.8 LaTeX Paper Structure

Large papers should be split into sections for easier editing and collaboration. Use a master file that inputs each section.

Project Structure

📁 paper/
├── main.tex ← Master file (compile this)
├── preamble.tex ← Packages and settings
├── sections/
├── 01_introduction.tex
├── 02_literature.tex
├── 03_data.tex
├── 04_methodology.tex
├── 05_results.tex
├── 06_conclusion.tex
└── appendix.tex
├── bibliography.bib
└── figures/ ← Or symlink to ../output/figures

Master File (main.tex)

Why split a LaTeX paper into multiple files instead of writing everything in one long document? For the same reason you split code into separate scripts: it keeps things manageable. Each section lives in its own .tex file, and the master file (main.tex) pulls them together using \input{} commands. The \input{} command simply pastes the contents of another file at that point -- think of it as an #include or import statement. This means you can work on the methodology section without scrolling past 20 pages of literature review. It also makes collaboration easier: two co-authors can edit different section files without creating merge conflicts.

Notice that the preamble (all the package imports and settings) is kept in a separate file called preamble.tex. This is good practice because preamble code rarely changes and would otherwise clutter the top of your main file. Keeping it separate lets you focus on the document structure when you open main.tex.

% main.tex - Master file for paper
\documentclass[12pt]{article}

% Load preamble (packages, settings, custom commands)
\input{preamble}

\title{The Employment Effects of Minimum Wage Increases}
\author{Jane Smith\thanks{University of Example. Email: jsmith@example.edu}}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
Your abstract here...
\end{abstract}

% Main content - each section in separate file
\input{sections/01_introduction}
\input{sections/02_literature}
\input{sections/03_data}
\input{sections/04_methodology}
\input{sections/05_results}
\input{sections/06_conclusion}

% Bibliography
\bibliographystyle{aer}
\bibliography{bibliography}

% Appendix
\appendix
\input{sections/appendix}

\end{document}

Preamble File (preamble.tex)

% preamble.tex - Packages and settings

% Essential packages
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs}        % Professional tables
\usepackage{natbib}          % Citations
\usepackage{hyperref}        % Clickable links
\usepackage{setspace}        % Line spacing
\usepackage[margin=1in]{geometry}

% For including Stata/R output tables
\usepackage{threeparttable}  % Table notes
\usepackage{tabularx}        % Flexible columns
\usepackage{siunitx}         % Number alignment

% Path to figures and tables (relative to main.tex)
\graphicspath{{../output/figures/}{figures/}}

% Custom commands for convenience
\newcommand{\tabpath}{../output/tables/}

% Double spacing for submission
\doublespacing

Including Generated Tables

% In sections/05_results.tex

\section{Results}

Table \ref{tab:main} presents our main results...

\begin{table}[htbp]
\centering
\caption{Effect of Minimum Wage on Employment}
\label{tab:main}
\input{\tabpath main_results.tex}
\end{table}

The coefficient on minimum wage in column (3) suggests that...

Compiling the Paper

A common surprise for beginners: compiling a LaTeX paper with a bibliography requires running pdflatex multiple times, not just once. Here is why. On the first pass, pdflatex reads your document and notes where you cite references and cross-reference tables or figures, but it does not know the final page numbers or citation details yet -- it writes those questions to auxiliary files. Then bibtex reads those auxiliary files, looks up your .bib bibliography database, and generates the formatted reference list. The second pdflatex pass incorporates the bibliography and resolves most cross-references. A third pass is needed to finalize any remaining references (like "see Table 3 on page 12") that shifted when the bibliography was inserted. If you only run pdflatex once, you will see question marks (??) wherever a citation or cross-reference should appear. The shortcut latexmk -pdf handles all of this automatically by running as many passes as needed.

# From the paper/ directory
cd paper

# Full compilation with bibliography
pdflatex main.tex       # First pass
bibtex main             # Process citations
pdflatex main.tex       # Resolve references
pdflatex main.tex       # Final pass

# Or use latexmk (handles everything automatically)
latexmk -pdf main.tex
VS Code + LaTeX Workshop: Install the LaTeX Workshop extension in VS Code for:
  • Auto-compilation on save
  • PDF preview side-by-side
  • Syntax highlighting and snippets
  • Forward/inverse search (click in PDF → go to code)

8.9 Replicating Published Papers

Where to find replication materials:

Steps to replicate:

  1. Find and download replication package
  2. Read the paper and README carefully
  3. Install required software/packages
  4. Run the master script
  5. Compare output to published tables
Essential Reading

Gentzkow, M. & Shapiro, J. (2014). "Code and Data for the Social Sciences"