ProTools ER1
Programming Tools for Empirical Research, Part 1 — A comprehensive course designed to bring PhD students in economics and quantitative researchers up to speed with both the classical toolset for estimation and empirical data analysis, and the newest platforms and tools transforming research today.
I am Giulia Caprini, Assistant Professor of Economics at Sciences Po. I designed and created this course to address a gap I observed in graduate training: students often learn statistical methods in theory but lack hands-on experience with the programming tools used in actual research.
This course was originally developed for Masters students at Sciences Po for the Spring Semester 2026, but the materials are designed to be useful for anyone seeking to strengthen their empirical research toolkit.
Course Philosophy
I take a trilingual approach: every concept is demonstrated in Python, Stata, and R simultaneously. My goal is not to make you an expert in all three languages, but to give you enough fluency to:
- Choose the right tool for each task
- Read and replicate code in any of these languages
- Translate between languages when collaborating
- Follow research norms for version control, documentation, and replicability
How to Use This Course
This course is designed to be interactive. Rather than passively reading, you'll actively engage with the material through several built-in features. Try them out below!
Code Explanations on Hover
Throughout the code examples, certain parts have a subtle dotted underline. Hover over them to see explanations of what the code does and why.
# Calculate average income by education level
import pandas as pd
df = pd.read_csv("survey_data.csv")
avg_income = df.groupby("education")["income"].mean()
print(avg_income)
Language Tabs
Most code examples include tabs for Python, Stata, and R. Click any tab to see the same operation in a different language—useful for comparing syntax or finding the language that feels most natural to you.
# Load data and run regression
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("data.csv")
X = sm.add_constant(df[["education", "experience"]])
model = sm.OLS(df["wage"], X).fit()
print(model.summary())
* Load data and run regression
use "data.dta", clear
regress wage education experience
# Load data and run regression
df <- read.csv("data.csv")
model <- lm(wage ~ education + experience, data = df)
summary(model)
Copy Code Button
Every code block has a copy button in the top-right corner. Click it to copy the code to your clipboard, then paste it directly into your Python script, Stata do-file, or R console.
# Quick summary statistics
import pandas as pd
df = pd.read_csv("your_data.csv")
print(df.describe())
Quizzes & Exercises
Many modules include interactive quizzes with instant feedback, coding exercises where you write and check your own solutions, and hidden solution dropdowns when you get stuck.
AI Teaching Assistant
Look for the chat icon in the bottom-right corner of every page. This AI assistant is trained on the course material and can help you understand concepts, debug code, or work through exercises.
How It Works
Your question is sent to a large language model (LLM) that has been provided context about the course material to generate a helpful response.
This assistant is powered by an LLM, which means its responses may be inaccurate, outdated, or incomplete. Always verify important information.
Your questions and the chatbot's answers are logged to help improve the course materials and identify common points of confusion. However, the chatbot is a shared resource with usage limits — please don't abuse it! Use it for genuine questions about the course content, not for unrelated queries or excessive testing.
Don't just read the code — run it yourself. Copy the examples into your own environment, modify them, break them, and fix them. Active experimentation is the fastest path to fluency.
Two-Part Course Structure
This first part focuses on the foundational skills every empirical researcher needs:
- Programming foundations: Python, Stata, and R taught in parallel — data import, cleaning, exploration, and visualization
- Causal inference methods: Matching, Difference-in-Differences, Regression Discontinuity, Instrumental Variables, Synthetic Control
- Estimation techniques: Standard errors, panel data methods, MLE/GMM
- Research best practices: Version control with Git & GitHub, replicability standards, project organization
- Machine learning & NLP: Introduction to ML for economists, history of NLP, understanding Large Language Models
The second part of the course focuses on AI tools and cutting-edge skills for modern research:
- Image & text analysis with LLMs: Extracting structured data from documents, PDFs, and images using AI
- Local LLM deployment: Setting up LM Studio, running open-source models locally on your machine
- AI coding assistants: Using Claude Code from Claude Desktop and VS Code for research workflows
- Hugging Face ecosystem: Accessing models, datasets, and Spaces for research applications
- API & inference services: Buying credits, managing API keys, choosing between providers (OpenAI, Anthropic, etc.)
- Building AI-powered tools: Research websites, data dashboards, and automated pipelines
- Prompt engineering: Advanced techniques for getting reliable outputs from LLMs
- RAG & fine-tuning: Retrieval-augmented generation and model customization for domain-specific tasks
Link to Part 2 will be provided when available.
Course Overview: Part 1
I. Foundations
| Module 0 | Languages & Platforms | Python vs. Stata vs. R, RStudio, VS Code, Google Colab, AI tools |
| Module 1 | Getting Started | Installation, basic syntax, packages |
| Module 2 | Data Harnessing | File import, APIs, web scraping |
| Module 3 | Data Exploration | Descriptive stats, visualization, EDA |
| Module 4 | Data Cleaning | Missing values, outliers, strings, dates |
| Module 5 | Data Analysis | Merging, reshaping, aggregation, data simulation |
II. Causal Inference & Estimation
| Module 6 | Causal Inference | Matching, DiD, RDD, IV, synthetic control, experiments |
| Module 7 | Estimation Methods | Standard errors, panel data, nonlinear models, MLE/GMM |
III. Research Best Practices
| Module 8 | Replicability | Project organization, documentation, replication packages |
| Module 9 | Git & GitHub | Version control, collaboration, branching, pull requests |
IV. Modern AI & ML
| Module 10 | History of NLP | From ELIZA to Transformers |
| Module 11 | Machine Learning | Prediction, regularization, trees, neural nets |
| Module 12 | Large Language Models | How LLMs work, prompting, APIs, limitations |
Who Is This For?
I designed this course primarily for:
- PhD students in economics learning empirical methods for their research
- Masters students preparing for research careers or industry positions
- Quantitative social scientists (political science, sociology, public policy)
- Research assistants working with faculty on empirical projects
- Anyone with a quantitative background who wants to learn modern research tools
Prerequisites
This course assumes:
- Basic familiarity with statistics (mean, variance, regression concepts)
- Some exposure to econometrics (helpful but not strictly required)
- No prior programming experience required — I start from the basics
- Willingness to learn and experiment!
Datasets Used
Throughout the course, I use freely available, well-documented datasets:
- Gapminder — Life expectancy, GDP, and population by country
- World Development Indicators — World Bank development data via API
- Current Population Survey (CPS) — Labor force statistics
- Card & Krueger (1994) — Minimum wage study for DiD
- Lee (2008) — Close elections data for RDD
- NSW Job Training — LaLonde (1986) data for matching
Acknowledgments
This course draws inspiration from many excellent resources. I am particularly indebted to:
- Scott Cunningham — Causal Inference: The Mixtape (Yale University Press, 2021)
- Nick Huntington-Klein — The Effect (Chapman & Hall, 2022)
- Angrist & Pischke — Mostly Harmless Econometrics (2009)
- Hadley Wickham — R for Data Science (O'Reilly)
- Wes McKinney — Python for Data Analysis (O'Reilly)
Course Assistant
I have integrated an AI assistant into this course to help you as you work through the materials. Click the chat icon in the bottom right corner to ask questions about the content, get help with code, or clarify concepts. Your conversations are logged to help me improve the course materials.
ProTools ER1 by Giulia Caprini (Sciences Po) | Spring 2026
All rights reserved. Course materials may not be copied, distributed, or reproduced without permission.