ProTools ER1

Programming Tools for Empirical Research, Part 1 — A comprehensive course designed to bring PhD students in economics and quantitative researchers up to speed with both the classical toolset for estimation and empirical data analysis, and the newest platforms and tools transforming research today.

About This Course

I am Giulia Caprini, Assistant Professor of Economics at Sciences Po. I designed and created this course to address a gap I observed in graduate training: students often learn statistical methods in theory but lack hands-on experience with the programming tools used in actual research.

This course was originally developed for Masters students at Sciences Po for the Spring Semester 2026, but the materials are designed to be useful for anyone seeking to strengthen their empirical research toolkit.

Course Philosophy

I take a trilingual approach: every concept is demonstrated in Python, Stata, and R simultaneously. My goal is not to make you an expert in all three languages, but to give you enough fluency to:

Choose the right tool for each task
Read and replicate code in any of these languages
Translate between languages when collaborating
Follow research norms for version control, documentation, and replicability

How to Use This Course

This course is designed to be interactive. Rather than passively reading, you'll actively engage with the material through several built-in features. Try them out below!

💡 Code Explanations on Hover

Throughout the code examples, certain parts have a subtle dotted underline. Hover over them to see explanations of what the code does and why.

Try it — hover over the underlined code

Python

# Calculate average income by education level
import pandas as pd

df = pd.read_csv("survey_data.csv")

avg_income = df.groupby("education")["income"].mean()

print(avg_income)

👆 Hover over the underlined parts to see explanations

🔄 Language Tabs

Most code examples include tabs for Python, Stata, and R. Click any tab to see the same operation in a different language—useful for comparing syntax or finding the language that feels most natural to you.

Try it — click the tabs to switch languages

# Load data and run regression
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("data.csv")
X = sm.add_constant(df[["education", "experience"]])
model = sm.OLS(df["wage"], X).fit()
print(model.summary())

* Load data and run regression
use "data.dta", clear

regress wage education experience

# Load data and run regression
df <- read.csv("data.csv")

model <- lm(wage ~ education + experience, data = df)
summary(model)

📋 Copy Code Button

Every code block has a copy button in the top-right corner. Click it to copy the code to your clipboard, then paste it directly into your Python script, Stata do-file, or R console.

Try it — click the copy button

Python

# Quick summary statistics
import pandas as pd

df = pd.read_csv("your_data.csv")
print(df.describe())

🧪 Quizzes & Exercises

Many modules include interactive quizzes with instant feedback, coding exercises where you write and check your own solutions, and hidden solution dropdowns when you get stuck.

Try it — answer the question

In Python, which operator is used to assign a value to a variable?

== (double equals)

= (single equals)

<- (arrow)

:= (colon equals)

🤖 AI Teaching Assistant

Look for the chat icon in the bottom-right corner of every page. This AI assistant is trained on the course material and can help you understand concepts, debug code, or work through exercises.

Hover here to see the chatbot

How It Works

Your question is sent to a large language model (LLM) that has been provided context about the course material to generate a helpful response.

⚠️ Answers May Contain Errors

This assistant is powered by an LLM, which means its responses may be inaccurate, outdated, or incomplete. Always verify important information.

📝 Please Use Responsibly
Your questions and the chatbot's answers are logged to help improve the course materials and identify common points of confusion. However, the chatbot is a shared resource with usage limits — please don't abuse it! Use it for genuine questions about the course content, not for unrelated queries or excessive testing.

Learning Tip

Don't just read the code — run it yourself. Copy the examples into your own environment, modify them, break them, and fix them. Active experimentation is the fastest path to fluency.

Two-Part Course Structure

Part 1: ProTools ER1 (This Course)

This first part focuses on the foundational skills every empirical researcher needs:

Programming foundations: Python, Stata, and R taught in parallel — data import, cleaning, exploration, and visualization

Causal inference methods: Matching, Difference-in-Differences, Regression Discontinuity, Instrumental Variables, Synthetic Control

Estimation techniques: Standard errors, panel data methods, MLE/GMM

Research best practices: Version control with Git & GitHub, replicability standards, project organization

Machine learning & NLP: Introduction to ML for economists, history of NLP, understanding Large Language Models

Part 2: ProTools ER2 (Coming Soon)

The second part of the course focuses on AI tools and cutting-edge skills for modern research:

Image & text analysis with LLMs: Extracting structured data from documents, PDFs, and images using AI

Local LLM deployment: Setting up LM Studio, running open-source models locally on your machine

AI coding assistants: Using Claude Code from Claude Desktop and VS Code for research workflows

Hugging Face ecosystem: Accessing models, datasets, and Spaces for research applications

API & inference services: Buying credits, managing API keys, choosing between providers (OpenAI, Anthropic, etc.)

Building AI-powered tools: Research websites, data dashboards, and automated pipelines

Prompt engineering: Advanced techniques for getting reliable outputs from LLMs

RAG & fine-tuning: Retrieval-augmented generation and model customization for domain-specific tasks

Link to Part 2 will be provided when available.

Course Overview: Part 1

I. Foundations

Module 0 Languages & Platforms Python vs. Stata vs. R, RStudio, VS Code, Google Colab, AI tools

Module 1 Getting Started Installation, basic syntax, packages

Module 2 Data Harnessing File import, APIs, web scraping

Module 3 Data Exploration Descriptive stats, visualization, EDA

Module 4 Data Cleaning Missing values, outliers, strings, dates

Module 5 Data Analysis Merging, reshaping, aggregation, data simulation

II. Causal Inference & Estimation

Module 6 Causal Inference Matching, DiD, RDD, IV, synthetic control, experiments

Module 7 Estimation Methods Standard errors, panel data, nonlinear models, MLE/GMM

III. Research Best Practices

Module 8 Replicability Project organization, documentation, replication packages

Module 9 Git & GitHub Version control, collaboration, branching, pull requests

IV. Modern AI & ML

Module 10 History of NLP From ELIZA to Transformers

Module 11 Machine Learning Prediction, regularization, trees, neural nets

Module 12 Large Language Models How LLMs work, prompting, APIs, limitations

Who Is This For?

I designed this course primarily for:

PhD students in economics learning empirical methods for their research

Masters students preparing for research careers or industry positions

Quantitative social scientists (political science, sociology, public policy)

Research assistants working with faculty on empirical projects

Anyone with a quantitative background who wants to learn modern research tools

Prerequisites

This course assumes:

Basic familiarity with statistics (mean, variance, regression concepts)

Some exposure to econometrics (helpful but not strictly required)

No prior programming experience required — I start from the basics

Willingness to learn and experiment!

Datasets Used

Throughout the course, I use freely available, well-documented datasets:

Gapminder — Life expectancy, GDP, and population by country

World Development Indicators — World Bank development data via API

Current Population Survey (CPS) — Labor force statistics

Card & Krueger (1994) — Minimum wage study for DiD

Lee (2008) — Close elections data for RDD

NSW Job Training — LaLonde (1986) data for matching

Acknowledgments

This course draws inspiration from many excellent resources. I am particularly indebted to:

Scott Cunningham — Causal Inference: The Mixtape (Yale University Press, 2021)

Nick Huntington-Klein — The Effect (Chapman & Hall, 2022)

Angrist & Pischke — Mostly Harmless Econometrics (2009)

Hadley Wickham — R for Data Science (O'Reilly)

Wes McKinney — Python for Data Analysis (O'Reilly)

Course Assistant

I have integrated an AI assistant into this course to help you as you work through the materials. Click the chat icon in the bottom right corner to ask questions about the content, get help with code, or clarify concepts. Your conversations are logged to help me improve the course materials.

ProTools ER1 by Giulia Caprini (Sciences Po) | Spring 2026
All rights reserved. Course materials may not be copied, distributed, or reproduced without permission.

Module 0: Languages & Platforms

Module 0	Languages & Platforms	Python vs. Stata vs. R, RStudio, VS Code, Google Colab, AI tools
Module 1	Getting Started	Installation, basic syntax, packages
Module 2	Data Harnessing	File import, APIs, web scraping
Module 3	Data Exploration	Descriptive stats, visualization, EDA
Module 4	Data Cleaning	Missing values, outliers, strings, dates
Module 5	Data Analysis	Merging, reshaping, aggregation, data simulation

Module 6	Causal Inference	Matching, DiD, RDD, IV, synthetic control, experiments
Module 7	Estimation Methods	Standard errors, panel data, nonlinear models, MLE/GMM

Module 8	Replicability	Project organization, documentation, replication packages
Module 9	Git & GitHub	Version control, collaboration, branching, pull requests

Module 10	History of NLP	From ELIZA to Transformers
Module 11	Machine Learning	Prediction, regularization, trees, neural nets
Module 12	Large Language Models	How LLMs work, prompting, APIs, limitations

ProTools ER1

Course Modules

ProTools ER1

Course Philosophy

How to Use This Course

💡 Code Explanations on Hover

🔄 Language Tabs

📋 Copy Code Button

🧪 Quizzes & Exercises

🤖 AI Teaching Assistant

How It Works

Two-Part Course Structure

Course Overview: Part 1

I. Foundations

II. Causal Inference & Estimation

III. Research Best Practices

IV. Modern AI & ML

Who Is This For?

Prerequisites

Datasets Used

Acknowledgments

Course Assistant

ProTools ER1 Assistant