1 Getting Started
Learning Objectives
- Install and configure Python, Stata, and R environments
- Understand the differences between the three languages
- Master basic syntax: variables, data types, and operations
- Install and load packages/libraries
- Write and execute your first data analysis script
Table of Contents
1.1 Why Learn Three Languages?
You might wonder why I teach Python, Stata, and R simultaneously rather than focusing on just one. The answer is practical: different fields and organizations have standardized on different tools, and being fluent in multiple languages makes you a more versatile researcher.
Learning all three gives you flexibility in your career and allows you to collaborate with researchers across disciplines. You'll also find that concepts learned in one language transfer easily to others.
If you haven't already set up your programming environment, please refer to Module 0: Languages & Platforms for detailed installation guides for each language and IDE. That module covers:
- VS Code for Python development
- Stata interface and setup
- RStudio for R programming
- Jupyter Notebooks for interactive coding
1.2 Basic Syntax Comparison
Let's start by comparing basic syntax across all three languages. Understanding these fundamentals will make everything else easier.
Assigning Variables
Variables store values that you can reference and manipulate throughout your code. Here's how to create variables in each language:
# Python: Variables and basic types
# Numeric types
age = 25 # Integer
income = 55000.50 # Float
# String
name = "Alice" # String (text)
# Boolean
is_employed = True # Boolean (True/False)
# Lists (arrays)
scores = [85, 92, 78, 95] # List of numbers
# Dictionary (key-value pairs)
person = {
"name": "Alice",
"age": 25,
"city": "Boston"
}
# Print values
print(f"Name: {name}, Age: {age}")
print(f"Average score: {sum(scores)/len(scores)}")
* Stata: Variables and basic types
* Stata works with datasets, not individual variables
* First, let's create some local macros (temporary variables)
local age = 25
local income = 55000.50
local name "Alice"
* Display values
display "Name: `name', Age: `age'"
* Global macros persist across programs
global project_name "My Analysis"
display "$project_name"
* Create a small dataset
clear
input score
85
92
78
95
end
* Calculate mean
summarize score
display "Average score: " r(mean)
# R: Variables and basic types
# Numeric types
age <- 25 # Integer
income <- 55000.50 # Numeric
# Character (string)
name <- "Alice" # Character
# Logical (boolean)
is_employed <- TRUE # Logical
# Vectors (R's primary data structure)
scores <- c(85, 92, 78, 95) # Numeric vector
# List (like Python dictionary)
person <- list(
name = "Alice",
age = 25,
city = "Boston"
)
# Print values
cat("Name:", name, ", Age:", age, "\n")
cat("Average score:", mean(scores), "\n")
Basic Arithmetic Operations
All three languages support standard mathematical operations. Let's see how they compare:
# Python: Arithmetic operations
a = 10
b = 3
print(f"Addition: {a + b}") # 13
print(f"Subtraction: {a - b}") # 7
print(f"Multiplication: {a * b}") # 30
print(f"Division: {a / b}") # 3.333...
print(f"Integer division: {a // b}") # 3
print(f"Modulo: {a % b}") # 1
print(f"Power: {a ** b}") # 1000
# Comparison operators
print(f"a > b: {a > b}") # True
print(f"a == b: {a == b}") # False
print(f"a != b: {a != b}") # True
* Stata: Arithmetic operations
local a = 10
local b = 3
display "Addition: " `a' + `b' // 13
display "Subtraction: " `a' - `b' // 7
display "Multiplication: " `a' * `b' // 30
display "Division: " `a' / `b' // 3.333...
display "Integer division: " floor(`a'/`b') // 3
display "Modulo: " mod(`a', `b') // 1
display "Power: " `a'^`b' // 1000
* Comparison operators (return 1 for true, 0 for false)
display "a > b: " (`a' > `b') // 1
display "a == b: " (`a' == `b') // 0
display "a != b: " (`a' != `b') // 1
# R: Arithmetic operations
a <- 10
b <- 3
cat("Addition:", a + b, "\n") # 13
cat("Subtraction:", a - b, "\n") # 7
cat("Multiplication:", a * b, "\n") # 30
cat("Division:", a / b, "\n") # 3.333...
cat("Integer division:", a %/% b, "\n") # 3
cat("Modulo:", a %% b, "\n") # 1
cat("Power:", a ^ b, "\n") # 1000
# Comparison operators
cat("a > b:", a > b, "\n") # TRUE
cat("a == b:", a == b, "\n") # FALSE
cat("a != b:", a != b, "\n") # TRUE
1.3 Package Management
One of the most powerful features of all three languages is their extensibility through packages (also called libraries or modules). Packages provide pre-written functions that save you from reinventing the wheel.
Essential Packages for ProTools ER1
| Purpose | Python | Stata | R |
|---|---|---|---|
| Data manipulation | pandas |
Built-in | dplyr, tidyr |
| Visualization | matplotlib, seaborn |
Built-in | ggplot2 |
| Statistics | scipy, statsmodels |
Built-in | Built-in, broom |
| Causal inference | causalinference, econml |
rdrobust, did |
fixest, rdrobust |
Installing and Loading Packages
# Python: Installing and importing packages
# Install from command line/terminal:
# pip install pandas numpy matplotlib seaborn
# Import packages in your script
import pandas as pd # Data manipulation
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical visualization
# Import specific functions
from scipy import stats
from statsmodels.formula.api import ols
# Check package version
print(f"Pandas version: {pd.__version__}")
* Stata: Installing and using packages
* Most Stata functionality is built-in
* User-written packages come from SSC (Statistical Software Components)
* Install a package from SSC
ssc install rdrobust // RD estimation
ssc install outreg2 // Export regression tables
ssc install estout // More regression output options
* Install from GitHub (using net install)
net install did, from("https://raw.githubusercontent.com/bcallaway11/did/master/stata/") replace
* Update all installed packages
adoupdate, update
* Search for packages
search difference-in-differences
* Check installed packages
ado dir
# R: Installing and loading packages
# Install packages (only need to do once)
install.packages("tidyverse") # Includes dplyr, ggplot2, tidyr, etc.
install.packages("fixest") # Fast fixed effects estimation
install.packages("rdrobust") # Regression discontinuity
# Load packages (do every session)
library(tidyverse) # Data manipulation & viz
library(fixest) # Econometrics
library(haven) # Read Stata/SPSS/SAS files
# Install from GitHub
# install.packages("devtools")
# devtools::install_github("author/package")
# Check package version
packageVersion("tidyverse")
1.4 Essential Syntax Patterns
Before writing full scripts, let's cover several syntax patterns that appear constantly in research code. Understanding these patterns will help you read and write code across all three languages.
Keyword Arguments
Functions often accept keyword arguments (also called named arguments): you pass values by name using the name=value syntax. This makes code more readable and lets you skip optional parameters you don't need.
# Python: Keyword arguments
import pandas as pd
# Positional argument: 'data.csv' is the first (required) argument
# Keyword arguments: encoding and sep are passed by name
df = pd.read_csv('data.csv', encoding='utf-8', sep=';')
# Keyword arguments make the intent clear:
df = pd.read_csv('data.csv',
sep=';', # semicolon delimiter (European data)
encoding='latin-1', # handle accented characters
na_values=['.', '']) # treat these as missing
* Stata: Options after the comma act like keyword arguments
* Everything after the comma is an option (named parameter)
import delimited "data.csv", encoding("utf-8") delimiter(";") clear
* summarize with the 'detail' option
summarize income, detail
* regress with the 'robust' option
reg income treatment age, robust
# R: Named arguments work the same way
library(readr)
# Named arguments: delim, locale, na are passed by name
df <- read_delim("data.csv",
delim = ";",
locale = locale(encoding = "latin1"),
na = c("", "NA", "."))
# na.rm = TRUE is a named argument you'll see everywhere
mean(df$income, na.rm = TRUE)
The Dot Accessor and Method Chaining
In Python, the dot accessor pattern (object.method()) calls a function that belongs to an object. When you chain multiple dots, each method operates on the result of the previous one β this is called method chaining.
# Python: The dot accessor pattern
import pandas as pd
# Single dot: call a method on an object
df.head() # DataFrame.head() shows first 5 rows
df.describe() # DataFrame.describe() shows summary stats
# Method chaining: each dot operates on the previous result
result = df.groupby('region')['income'].mean()
# Step by step:
# df.groupby('region') β GroupBy object
# ...['income'] β select the income column
# .mean() β compute the mean per group
# IMPORTANT: .mean refers to the method object; .mean() calls it
# Forgetting parentheses is a common bug!
wrong = df['income'].mean # Bug: returns method object, not a number
right = df['income'].mean() # Correct: calls the method, returns the mean
* Stata: Commands are separate (no method chaining)
* Each command is its own line
use "data.dta", clear
describe
summarize income
* Stata achieves "grouping" with the bysort prefix
bysort region: summarize income
# R: The pipe operator %>% chains operations (like method chaining)
library(dplyr)
# The pipe passes the result of each step to the next function
result <- df %>%
group_by(region) %>%
summarise(avg_income = mean(income, na.rm = TRUE))
# R 4.1+ also has a native pipe: |>
result <- df |>
group_by(region) |>
summarise(avg_income = mean(income, na.rm = TRUE))
For Loops
Loops let you repeat operations. Each language has its own syntax for iterating over sequences of values.
# Python: for loops
# Loop over a range of numbers
for i in range(10):
print(i) # prints 0, 1, 2, ..., 9
# Loop over a list of items
for country in ['US', 'UK', 'FR']:
print(country)
# Loop over DataFrame columns
for col in df.columns:
print(col, df[col].dtype)
* Stata: foreach and forvalues loops
* Loop over a range of numbers
forvalues i = 1/10 {
display `i'
}
* Loop over a list of items
foreach c in US UK FR {
display "`c'"
}
* Loop over variables
foreach v of varlist income education age {
summarize `v'
}
# R: for loops
# Loop over a sequence of numbers
for (i in 1:10) {
print(i)
}
# Loop over a vector of items
for (country in c("US", "UK", "FR")) {
print(country)
}
# Loop over column names
for (col in names(df)) {
print(paste(col, class(df[[col]])))
}
Nested Function Calls
You can pass the output of one function directly as the input to another. Nested calls evaluate inside-out: the innermost function runs first, and its result becomes the argument to the next function.
# Python: Nested function calls
import numpy as np
incomes = [30000, 45000, 52000, 48000, 120000]
# Nested: evaluate inside-out
result = np.round(np.mean(np.log(incomes)), 2)
# Step 1: np.log(incomes) β log-transform each value
# Step 2: np.mean(...) β average the log values
# Step 3: np.round(..., 2) β round to 2 decimal places
# Equivalent step-by-step (easier to debug):
log_incomes = np.log(incomes)
mean_log = np.mean(log_incomes)
result = np.round(mean_log, 2)
* Stata: Nested functions work the same way
display round(ln(50000), 0.01)
* Step 1: ln(50000) β natural log = 10.819...
* Step 2: round(..., 0.01) β round to 2 decimals = 10.82
# R: Nested function calls
incomes <- c(30000, 45000, 52000, 48000, 120000)
# Nested: innermost runs first
result <- round(mean(log(incomes)), 2)
# Or use the pipe to read left-to-right:
result <- incomes %>% log() %>% mean() %>% round(2)
Defining Functions and Variable Scope
You can define your own reusable functions. An important concept is scope: variables defined inside a function are local to it, while variables defined outside (global) can be accessed from within β but relying on globals is fragile.
# Python: Function definition and scope
def compute_tax(dataframe, rate):
"""Add a tax column to the dataframe."""
dataframe['tax'] = dataframe['income'] * rate
return dataframe
# Call the function with explicit arguments
df = compute_tax(df, rate=0.25)
# Scope pitfall: this works but is fragile
rate = 0.25 # global variable
def bad_tax(dataframe):
# 'rate' is found in global scope β works, but fragile
dataframe['tax'] = dataframe['income'] * rate
return dataframe
# Better: pass rate as a parameter (explicit is better than implicit)
* Stata: Programs are Stata's equivalent of functions
program define compute_tax
args varname rate
gen tax = `varname' * `rate'
end
* Call it
compute_tax income 0.25
* Scope: local macros are local to the program
* global macros are visible everywhere (use sparingly)
# R: Function definition and scope
compute_tax <- function(dataframe, rate) {
dataframe$tax <- dataframe$income * rate
return(dataframe)
}
# Call the function
df <- compute_tax(df, rate = 0.25)
# R scope: functions look in parent environments for undefined variables
# Same pitfall as Python β always pass values as parameters
Indexing: 0-Based vs 1-Based
A critical difference between languages: Python uses 0-based indexing (the first element is at position 0), while R and Stata use 1-based indexing (the first element is at position 1). This is a frequent source of off-by-one bugs when translating code.
# Python: 0-based indexing
gdp = [100, 200, 300]
print(gdp[0]) # 100 β first element
print(gdp[1]) # 200 β second element
print(gdp[2]) # 300 β third element
# gdp[3] would be IndexError β only 3 elements (0, 1, 2)
* Stata: 1-based indexing
clear
input gdp
100
200
300
end
display gdp[1] // 100 β first observation
display gdp[2] // 200 β second observation
display gdp[3] // 300 β third observation
# R: 1-based indexing
gdp <- c(100, 200, 300)
print(gdp[1]) # 100 β first element
print(gdp[2]) # 200 β second element
print(gdp[3]) # 300 β third element
# gdp[0] returns numeric(0) in R β it doesn't error, just returns empty
Python has two division operators: / (true division: 7/2 = 3.5) and // (floor division: 7//2 = 3). Stata and R only have /, which always gives true division. When translating code, using // instead of / silently truncates your results.
1.5 Your First Script
Let's put everything together by writing a simple script that loads data, performs basic calculations, and creates a visualization. We'll use the famous Gapminder dataset, which contains life expectancy, GDP per capita, and population data for countries over time.
The Gapminder dataset was popularized by Hans Rosling's famous TED talks. It's an excellent dataset for learning because it's clean, intuitive, and contains both numeric and categorical variables. Access it via the Gapminder website or through packages in each language.
# Python: First script with Gapminder data
# Script: 01_gapminder_analysis.py
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
# Load data (using gapminder from plotly for convenience)
from gapminder import gapminder
# Alternatively, load from URL:
# url = "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv"
# gapminder = pd.read_csv(url, sep='\t')
# Explore the data
print("First 5 rows:")
print(gapminder.head())
print("\nDataset shape:", gapminder.shape)
print("\nColumn types:")
print(gapminder.dtypes)
# Summary statistics
print("\nSummary statistics:")
print(gapminder.describe())
# Filter to year 2007
gap_2007 = gapminder[gapminder['year'] == 2007]
# Calculate average life expectancy by continent
life_exp_by_continent = gap_2007.groupby('continent')['lifeExp'].mean()
print("\nLife expectancy by continent (2007):")
print(life_exp_by_continent)
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(gap_2007['gdpPercap'], gap_2007['lifeExp'],
s=gap_2007['pop']/1e6, alpha=0.6)
plt.xlabel('GDP per capita')
plt.ylabel('Life expectancy')
plt.title('Life Expectancy vs GDP per Capita (2007)')
plt.xscale('log')
plt.savefig('gapminder_plot.png', dpi=150)
plt.show()
* Stata: First script with Gapminder data
* Script: 01_gapminder_analysis.do
* Set working directory (change to your path)
cd "~/Documents/protools_er1"
* Load Gapminder data from URL
import delimited "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv", clear
* Explore the data
describe
list in 1/5
* Summary statistics
summarize
summarize lifeexp gdppercap pop, detail
* Filter to year 2007
keep if year == 2007
* Calculate average life expectancy by continent
table continent, statistic(mean lifeexp) nformat(%5.1f)
* Alternative: collapse to create summary dataset
preserve
collapse (mean) lifeexp, by(continent)
list
restore
* Create a scatter plot
* -----------------------------------------------------------------------
* twoway: Stata's main graphing command for XY plots
* scatter: plot type - creates a scatter plot of lifeexp (Y) vs gdppercap (X)
* [w=pop]: analytical weights - larger bubbles for countries with larger populations
* msymbol(Oh): marker symbol - hollow circles ("O" = circle, "h" = hollow)
* xscale(log): display X axis on logarithmic scale (useful for GDP data)
* xtitle/ytitle: axis labels
* title: main plot title
* -----------------------------------------------------------------------
twoway (scatter lifeexp gdppercap [w=pop], msymbol(Oh)), xscale(log) xtitle("GDP per capita (log scale)") ytitle("Life expectancy") title("Life Expectancy vs GDP per Capita (2007)")
graph export "gapminder_plot.png", replace
# R: First script with Gapminder data
# Script: 01_gapminder_analysis.R
# Install packages (only need to run once, then comment out)
# install.packages("tidyverse") # data manipulation and visualization
# install.packages("gapminder") # example dataset
# Load packages
library(tidyverse)
library(gapminder)
# Explore the data
head(gapminder)
glimpse(gapminder)
# Summary statistics
summary(gapminder)
# Filter to year 2007
gap_2007 <- gapminder %>%
filter(year == 2007)
# Calculate average life expectancy by continent
life_exp_by_continent <- gap_2007 %>%
group_by(continent) %>%
summarize(
mean_life_exp = mean(lifeExp),
n_countries = n()
)
print(life_exp_by_continent)
# Create a scatter plot with ggplot2
ggplot(gap_2007, aes(x = gdpPercap, y = lifeExp,
size = pop, color = continent)) +
geom_point(alpha = 0.7) +
scale_x_log10(labels = scales::comma) +
scale_size(range = c(2, 12), guide = "none") +
labs(
x = "GDP per capita (log scale)",
y = "Life expectancy",
title = "Life Expectancy vs GDP per Capita (2007)",
color = "Continent"
) +
theme_minimal()
ggsave("gapminder_plot.png", width = 10, height = 6, dpi = 150)
Generated figures:
1.6 Exercises
Exercise 1.1: Setup Verification
Write code to accomplish these tasks:
- Print "Hello, ProTools ER1!" to the console
- Calculate 210 (2 to the power of 10)
- Create a list/vector of numbers 1-5 and calculate their sum
Exercise 1.2: Gapminder Exploration
Using the Gapminder dataset, answer these questions in your preferred language:
- How many unique countries are in the dataset?
- What was the average GDP per capita in 1952 vs 2007?
- Which country had the highest life expectancy in 2007?
- Create a line plot showing life expectancy over time for a country of your choice
Load the data from: https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv
Exercise 1.3: Package Installation
Install the following packages, which I use throughout the course:
- Python:
pandas,numpy,matplotlib,seaborn,statsmodels - Stata:
outreg2,rdrobust,estout - R:
tidyverse,fixest,modelsummary,haven
- Python for Data Analysis by Wes McKinney β Chapters 1-3
- Stata Programming Tutorial by German Rodriguez β Introduction
- R for Data Science by Hadley Wickham β Chapters 1-2