1  Getting Started

~4 hours Setup & Basics Beginner

Learning Objectives

  • Install and configure Python, Stata, and R environments
  • Understand the differences between the three languages
  • Master basic syntax: variables, data types, and operations
  • Install and load packages/libraries
  • Write and execute your first data analysis script

1.1 Why Learn Three Languages?

You might wonder why I teach Python, Stata, and R simultaneously rather than focusing on just one. The answer is practical: different fields and organizations have standardized on different tools, and being fluent in multiple languages makes you a more versatile researcher.

Learning all three gives you flexibility in your career and allows you to collaborate with researchers across disciplines. You'll also find that concepts learned in one language transfer easily to others.

Installation & Setup

If you haven't already set up your programming environment, please refer to Module 0: Languages & Platforms for detailed installation guides for each language and IDE. That module covers:

1.2 Basic Syntax Comparison

Let's start by comparing basic syntax across all three languages. Understanding these fundamentals will make everything else easier.

Assigning Variables

Variables store values that you can reference and manipulate throughout your code. Here's how to create variables in each language:

# Python: Variables and basic types

# Numeric types
age = 25                    # Integer
income = 55000.50           # Float

# String
name = "Alice"              # String (text)

# Boolean
is_employed = True          # Boolean (True/False)

# Lists (arrays)
scores = [85, 92, 78, 95]   # List of numbers

# Dictionary (key-value pairs)
person = {
    "name": "Alice",
    "age": 25,
    "city": "Boston"
}

# Print values
print(f"Name: {name}, Age: {age}")
print(f"Average score: {sum(scores)/len(scores)}")
* Stata: Variables and basic types

* Stata works with datasets, not individual variables
* First, let's create some local macros (temporary variables)

local age = 25
local income = 55000.50
local name "Alice"

* Display values
display "Name: `name', Age: `age'"

* Global macros persist across programs
global project_name "My Analysis"
display "$project_name"

* Create a small dataset
clear
input score
    85
    92
    78
    95
end

* Calculate mean
summarize score
display "Average score: " r(mean)
# R: Variables and basic types

# Numeric types
age <- 25                    # Integer
income <- 55000.50           # Numeric

# Character (string)
name <- "Alice"              # Character

# Logical (boolean)
is_employed <- TRUE          # Logical

# Vectors (R's primary data structure)
scores <- c(85, 92, 78, 95)   # Numeric vector

# List (like Python dictionary)
person <- list(
  name = "Alice",
  age = 25,
  city = "Boston"
)

# Print values
cat("Name:", name, ", Age:", age, "\n")
cat("Average score:", mean(scores), "\n")
Python Output
Name: Alice, Age: 25 Average score: 87.5
Stata Output
. display "Name: `name', Age: `age'" Name: Alice, Age: 25 . display "$project_name" My Analysis . summarize score Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- score | 4 87.5 7.593857 78 95 . display "Average score: " r(mean) Average score: 87.5
R Output
Name: Alice , Age: 25 Average score: 87.5

Basic Arithmetic Operations

All three languages support standard mathematical operations. Let's see how they compare:

# Python: Arithmetic operations

a = 10
b = 3

print(f"Addition: {a + b}")        # 13
print(f"Subtraction: {a - b}")     # 7
print(f"Multiplication: {a * b}")  # 30
print(f"Division: {a / b}")        # 3.333...
print(f"Integer division: {a // b}") # 3
print(f"Modulo: {a % b}")          # 1
print(f"Power: {a ** b}")          # 1000

# Comparison operators
print(f"a > b: {a > b}")          # True
print(f"a == b: {a == b}")        # False
print(f"a != b: {a != b}")        # True
* Stata: Arithmetic operations

local a = 10
local b = 3

display "Addition: " `a' + `b'           // 13
display "Subtraction: " `a' - `b'        // 7
display "Multiplication: " `a' * `b'     // 30
display "Division: " `a' / `b'           // 3.333...
display "Integer division: " floor(`a'/`b') // 3
display "Modulo: " mod(`a', `b')        // 1
display "Power: "  `a'^`b'               // 1000

* Comparison operators (return 1 for true, 0 for false)
display "a > b: " (`a' > `b')           // 1
display "a == b: " (`a' == `b')         // 0
display "a != b: " (`a' != `b')         // 1
# R: Arithmetic operations

a <- 10
b <- 3

cat("Addition:", a + b, "\n")          # 13
cat("Subtraction:", a - b, "\n")       # 7
cat("Multiplication:", a * b, "\n")    # 30
cat("Division:", a / b, "\n")          # 3.333...
cat("Integer division:", a %/% b, "\n") # 3
cat("Modulo:", a %% b, "\n")           # 1
cat("Power:", a ^ b, "\n")             # 1000

# Comparison operators
cat("a > b:", a > b, "\n")            # TRUE
cat("a == b:", a == b, "\n")          # FALSE
cat("a != b:", a != b, "\n")          # TRUE
Python Output
Addition: 13 Subtraction: 7 Multiplication: 30 Division: 3.3333333333333335 Integer division: 3 Modulo: 1 Power: 1000 a > b: True a == b: False a != b: True
Stata Output
. display "Addition: " `a' + `b' Addition: 13 . display "Subtraction: " `a' - `b' Subtraction: 7 . display "Multiplication: " `a' * `b' Multiplication: 30 . display "Division: " `a' / `b' Division: 3.3333333 . display "Integer division: " floor(`a'/`b') Integer division: 3 . display "Modulo: " mod(`a', `b') Modulo: 1 . display "Power: " `a'^`b' Power: 1000 . display "a > b: " (`a' > `b') a > b: 1 . display "a == b: " (`a' == `b') a == b: 0 . display "a != b: " (`a' != `b') a != b: 1
R Output
Addition: 13 Subtraction: 7 Multiplication: 30 Division: 3.333333 Integer division: 3 Modulo: 1 Power: 1000 a > b: TRUE a == b: FALSE a != b: TRUE

1.3 Package Management

One of the most powerful features of all three languages is their extensibility through packages (also called libraries or modules). Packages provide pre-written functions that save you from reinventing the wheel.

Essential Packages for ProTools ER1

Purpose Python Stata R
Data manipulation pandas Built-in dplyr, tidyr
Visualization matplotlib, seaborn Built-in ggplot2
Statistics scipy, statsmodels Built-in Built-in, broom
Causal inference causalinference, econml rdrobust, did fixest, rdrobust

Installing and Loading Packages

# Python: Installing and importing packages

# Install from command line/terminal:
# pip install pandas numpy matplotlib seaborn

# Import packages in your script
import pandas as pd        # Data manipulation
import numpy as np         # Numerical computing
import matplotlib.pyplot as plt  # Plotting
import seaborn as sns       # Statistical visualization

# Import specific functions
from scipy import stats
from statsmodels.formula.api import ols

# Check package version
print(f"Pandas version: {pd.__version__}")
* Stata: Installing and using packages

* Most Stata functionality is built-in
* User-written packages come from SSC (Statistical Software Components)

* Install a package from SSC
ssc install rdrobust      // RD estimation
ssc install outreg2       // Export regression tables
ssc install estout        // More regression output options

* Install from GitHub (using net install)
net install did, from("https://raw.githubusercontent.com/bcallaway11/did/master/stata/") replace

* Update all installed packages
adoupdate, update

* Search for packages
search difference-in-differences

* Check installed packages
ado dir
# R: Installing and loading packages

# Install packages (only need to do once)
install.packages("tidyverse")   # Includes dplyr, ggplot2, tidyr, etc.
install.packages("fixest")      # Fast fixed effects estimation
install.packages("rdrobust")    # Regression discontinuity

# Load packages (do every session)
library(tidyverse)   # Data manipulation & viz
library(fixest)      # Econometrics
library(haven)       # Read Stata/SPSS/SAS files

# Install from GitHub
# install.packages("devtools")
# devtools::install_github("author/package")

# Check package version
packageVersion("tidyverse")
Python Output
Pandas version: 2.1.4
Stata Output
. ssc install rdrobust checking rdrobust consistency and samples installing into c:\ado\plus\... installation complete. . ado dir [1] package rdrobust from http://fmwww.bc.edu/repec/bocode/r 'RDROBUST': module to provide robust data-driven inference
R Output
── Attaching core tidyverse packages ──────────────── tidyverse 2.0.0 ── βœ” dplyr 1.1.4 βœ” readr 2.1.5 βœ” forcats 1.0.0 βœ” stringr 1.5.1 βœ” ggplot2 3.5.0 βœ” tibble 3.2.1 βœ” lubridate 1.9.3 βœ” tidyr 1.3.1 βœ” purrr 1.0.2 ── Conflicts ────────────────────────────────── tidyverse_conflicts() ── βœ– dplyr::filter() masks stats::filter() βœ– dplyr::lag() masks stats::lag() β„Ή Use the conflicted package to force all conflicts to become errors [1] '2.0.0'

1.4 Essential Syntax Patterns

Before writing full scripts, let's cover several syntax patterns that appear constantly in research code. Understanding these patterns will help you read and write code across all three languages.

Keyword Arguments

Functions often accept keyword arguments (also called named arguments): you pass values by name using the name=value syntax. This makes code more readable and lets you skip optional parameters you don't need.

# Python: Keyword arguments
import pandas as pd

# Positional argument: 'data.csv' is the first (required) argument
# Keyword arguments: encoding and sep are passed by name
df = pd.read_csv('data.csv', encoding='utf-8', sep=';')

# Keyword arguments make the intent clear:
df = pd.read_csv('data.csv',
                  sep=';',           # semicolon delimiter (European data)
                  encoding='latin-1', # handle accented characters
                  na_values=['.', ''])  # treat these as missing
* Stata: Options after the comma act like keyword arguments

* Everything after the comma is an option (named parameter)
import delimited "data.csv", encoding("utf-8") delimiter(";") clear

* summarize with the 'detail' option
summarize income, detail

* regress with the 'robust' option
reg income treatment age, robust
# R: Named arguments work the same way
library(readr)

# Named arguments: delim, locale, na are passed by name
df <- read_delim("data.csv",
                delim = ";",
                locale = locale(encoding = "latin1"),
                na = c("", "NA", "."))

# na.rm = TRUE is a named argument you'll see everywhere
mean(df$income, na.rm = TRUE)

The Dot Accessor and Method Chaining

In Python, the dot accessor pattern (object.method()) calls a function that belongs to an object. When you chain multiple dots, each method operates on the result of the previous one β€” this is called method chaining.

# Python: The dot accessor pattern
import pandas as pd

# Single dot: call a method on an object
df.head()           # DataFrame.head() shows first 5 rows
df.describe()       # DataFrame.describe() shows summary stats

# Method chaining: each dot operates on the previous result
result = df.groupby('region')['income'].mean()
# Step by step:
#   df.groupby('region')  β†’ GroupBy object
#   ...['income']         β†’ select the income column
#   .mean()               β†’ compute the mean per group

# IMPORTANT: .mean refers to the method object; .mean() calls it
# Forgetting parentheses is a common bug!
wrong = df['income'].mean    # Bug: returns method object, not a number
right = df['income'].mean()   # Correct: calls the method, returns the mean
* Stata: Commands are separate (no method chaining)
* Each command is its own line

use "data.dta", clear
describe
summarize income

* Stata achieves "grouping" with the bysort prefix
bysort region: summarize income
# R: The pipe operator %>% chains operations (like method chaining)
library(dplyr)

# The pipe passes the result of each step to the next function
result <- df %>%
  group_by(region) %>%
  summarise(avg_income = mean(income, na.rm = TRUE))

# R 4.1+ also has a native pipe: |>
result <- df |>
  group_by(region) |>
  summarise(avg_income = mean(income, na.rm = TRUE))

For Loops

Loops let you repeat operations. Each language has its own syntax for iterating over sequences of values.

# Python: for loops

# Loop over a range of numbers
for i in range(10):
    print(i)           # prints 0, 1, 2, ..., 9

# Loop over a list of items
for country in ['US', 'UK', 'FR']:
    print(country)

# Loop over DataFrame columns
for col in df.columns:
    print(col, df[col].dtype)
* Stata: foreach and forvalues loops

* Loop over a range of numbers
forvalues i = 1/10 {
    display `i'
}

* Loop over a list of items
foreach c in US UK FR {
    display "`c'"
}

* Loop over variables
foreach v of varlist income education age {
    summarize `v'
}
# R: for loops

# Loop over a sequence of numbers
for (i in 1:10) {
  print(i)
}

# Loop over a vector of items
for (country in c("US", "UK", "FR")) {
  print(country)
}

# Loop over column names
for (col in names(df)) {
  print(paste(col, class(df[[col]])))
}

Nested Function Calls

You can pass the output of one function directly as the input to another. Nested calls evaluate inside-out: the innermost function runs first, and its result becomes the argument to the next function.

# Python: Nested function calls
import numpy as np

incomes = [30000, 45000, 52000, 48000, 120000]

# Nested: evaluate inside-out
result = np.round(np.mean(np.log(incomes)), 2)
# Step 1: np.log(incomes)  β†’ log-transform each value
# Step 2: np.mean(...)     β†’ average the log values
# Step 3: np.round(..., 2) β†’ round to 2 decimal places

# Equivalent step-by-step (easier to debug):
log_incomes = np.log(incomes)
mean_log = np.mean(log_incomes)
result = np.round(mean_log, 2)
* Stata: Nested functions work the same way

display round(ln(50000), 0.01)
* Step 1: ln(50000)         β†’ natural log = 10.819...
* Step 2: round(..., 0.01)  β†’ round to 2 decimals = 10.82
# R: Nested function calls
incomes <- c(30000, 45000, 52000, 48000, 120000)

# Nested: innermost runs first
result <- round(mean(log(incomes)), 2)

# Or use the pipe to read left-to-right:
result <- incomes %>% log() %>% mean() %>% round(2)

Defining Functions and Variable Scope

You can define your own reusable functions. An important concept is scope: variables defined inside a function are local to it, while variables defined outside (global) can be accessed from within β€” but relying on globals is fragile.

# Python: Function definition and scope

def compute_tax(dataframe, rate):
    """Add a tax column to the dataframe."""
    dataframe['tax'] = dataframe['income'] * rate
    return dataframe

# Call the function with explicit arguments
df = compute_tax(df, rate=0.25)

# Scope pitfall: this works but is fragile
rate = 0.25  # global variable
def bad_tax(dataframe):
    # 'rate' is found in global scope β€” works, but fragile
    dataframe['tax'] = dataframe['income'] * rate
    return dataframe
# Better: pass rate as a parameter (explicit is better than implicit)
* Stata: Programs are Stata's equivalent of functions

program define compute_tax
    args varname rate
    gen tax = `varname' * `rate'
end

* Call it
compute_tax income 0.25

* Scope: local macros are local to the program
* global macros are visible everywhere (use sparingly)
# R: Function definition and scope

compute_tax <- function(dataframe, rate) {
  dataframe$tax <- dataframe$income * rate
  return(dataframe)
}

# Call the function
df <- compute_tax(df, rate = 0.25)

# R scope: functions look in parent environments for undefined variables
# Same pitfall as Python β€” always pass values as parameters

Indexing: 0-Based vs 1-Based

A critical difference between languages: Python uses 0-based indexing (the first element is at position 0), while R and Stata use 1-based indexing (the first element is at position 1). This is a frequent source of off-by-one bugs when translating code.

# Python: 0-based indexing
gdp = [100, 200, 300]

print(gdp[0])   # 100 β€” first element
print(gdp[1])   # 200 β€” second element
print(gdp[2])   # 300 β€” third element
# gdp[3] would be IndexError β€” only 3 elements (0, 1, 2)
* Stata: 1-based indexing
clear
input gdp
100
200
300
end

display gdp[1]   // 100 β€” first observation
display gdp[2]   // 200 β€” second observation
display gdp[3]   // 300 β€” third observation
# R: 1-based indexing
gdp <- c(100, 200, 300)

print(gdp[1])   # 100 β€” first element
print(gdp[2])   # 200 β€” second element
print(gdp[3])   # 300 β€” third element
# gdp[0] returns numeric(0) in R β€” it doesn't error, just returns empty
Cross-Language Gotcha: Integer Division

Python has two division operators: / (true division: 7/2 = 3.5) and // (floor division: 7//2 = 3). Stata and R only have /, which always gives true division. When translating code, using // instead of / silently truncates your results.

1.5 Your First Script

Let's put everything together by writing a simple script that loads data, performs basic calculations, and creates a visualization. We'll use the famous Gapminder dataset, which contains life expectancy, GDP per capita, and population data for countries over time.

Dataset: Gapminder

The Gapminder dataset was popularized by Hans Rosling's famous TED talks. It's an excellent dataset for learning because it's clean, intuitive, and contains both numeric and categorical variables. Access it via the Gapminder website or through packages in each language.

# Python: First script with Gapminder data
# Script: 01_gapminder_analysis.py

# Import packages
import pandas as pd
import matplotlib.pyplot as plt

# Load data (using gapminder from plotly for convenience)
from gapminder import gapminder

# Alternatively, load from URL:
# url = "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv"
# gapminder = pd.read_csv(url, sep='\t')

# Explore the data
print("First 5 rows:")
print(gapminder.head())

print("\nDataset shape:", gapminder.shape)
print("\nColumn types:")
print(gapminder.dtypes)

# Summary statistics
print("\nSummary statistics:")
print(gapminder.describe())

# Filter to year 2007
gap_2007 = gapminder[gapminder['year'] == 2007]

# Calculate average life expectancy by continent
life_exp_by_continent = gap_2007.groupby('continent')['lifeExp'].mean()
print("\nLife expectancy by continent (2007):")
print(life_exp_by_continent)

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(gap_2007['gdpPercap'], gap_2007['lifeExp'],
            s=gap_2007['pop']/1e6, alpha=0.6)
plt.xlabel('GDP per capita')
plt.ylabel('Life expectancy')
plt.title('Life Expectancy vs GDP per Capita (2007)')
plt.xscale('log')
plt.savefig('gapminder_plot.png', dpi=150)
plt.show()
* Stata: First script with Gapminder data
* Script: 01_gapminder_analysis.do

* Set working directory (change to your path)
cd "~/Documents/protools_er1"

* Load Gapminder data from URL
import delimited "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv", clear

* Explore the data
describe
list in 1/5

* Summary statistics
summarize
summarize lifeexp gdppercap pop, detail

* Filter to year 2007
keep if year == 2007

* Calculate average life expectancy by continent
table continent, statistic(mean lifeexp) nformat(%5.1f)

* Alternative: collapse to create summary dataset
preserve
collapse (mean) lifeexp, by(continent)
list
restore

* Create a scatter plot
* -----------------------------------------------------------------------
* twoway: Stata's main graphing command for XY plots
* scatter: plot type - creates a scatter plot of lifeexp (Y) vs gdppercap (X)
* [w=pop]: analytical weights - larger bubbles for countries with larger populations
* msymbol(Oh): marker symbol - hollow circles ("O" = circle, "h" = hollow)
* xscale(log): display X axis on logarithmic scale (useful for GDP data)
* xtitle/ytitle: axis labels
* title: main plot title
* -----------------------------------------------------------------------
twoway (scatter lifeexp gdppercap [w=pop], msymbol(Oh)), xscale(log) xtitle("GDP per capita (log scale)") ytitle("Life expectancy") title("Life Expectancy vs GDP per Capita (2007)")

graph export "gapminder_plot.png", replace
# R: First script with Gapminder data
# Script: 01_gapminder_analysis.R

# Install packages (only need to run once, then comment out)
# install.packages("tidyverse")  # data manipulation and visualization
# install.packages("gapminder")  # example dataset

# Load packages
library(tidyverse)
library(gapminder)

# Explore the data
head(gapminder)
glimpse(gapminder)

# Summary statistics
summary(gapminder)

# Filter to year 2007
gap_2007 <- gapminder %>%
  filter(year == 2007)

# Calculate average life expectancy by continent
life_exp_by_continent <- gap_2007 %>%
  group_by(continent) %>%
  summarize(
    mean_life_exp = mean(lifeExp),
    n_countries = n()
  )

print(life_exp_by_continent)

# Create a scatter plot with ggplot2
ggplot(gap_2007, aes(x = gdpPercap, y = lifeExp,
                      size = pop, color = continent)) +
  geom_point(alpha = 0.7) +
  scale_x_log10(labels = scales::comma) +
  scale_size(range = c(2, 12), guide = "none") +
  labs(
    x = "GDP per capita (log scale)",
    y = "Life expectancy",
    title = "Life Expectancy vs GDP per Capita (2007)",
    color = "Continent"
  ) +
  theme_minimal()

ggsave("gapminder_plot.png", width = 10, height = 6, dpi = 150)
Python Output
First 5 rows: country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 Dataset shape: (1704, 6) Column types: country object continent object year int64 lifeExp float64 pop int64 gdpPercap float64 dtype: object Life expectancy by continent (2007): continent Africa 54.806038 Americas 73.608120 Asia 70.728485 Europe 77.648600 Oceania 80.719500 Name: lifeExp, dtype: float64

Generated figures:

Life Expectancy vs GDP per Capita (2007) 40 50 60 70 80 500 1k 5k 10k 50k GDP per capita (log scale) Life expectancy
gapminder_plot.png
Stata Output
. describe Contains data Observations: 1,704 Variables: 6 Variable Storage Display Value name type format label ─────────────────────────────────────────── country str24 %24s continent str8 %9s year int %10.0g lifeexp double %10.0g pop double %10.0g gdppercap double %10.0g . table continent, statistic(mean lifeexp) nformat(%5.1f) ───────────────────────── continent Mean(lifeexp) ───────────────────────── Africa 54.8 Americas 73.6 Asia 70.7 Europe 77.6 Oceania 80.7 ───────────────────────── (file gapminder_plot.png written in PNG format)
R Output
> head(gapminder) # A tibble: 6 Γ— 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. > print(life_exp_by_continent) # A tibble: 5 Γ— 3 continent mean_life_exp n_countries <fct> <dbl> <int> 1 Africa 54.8 52 2 Americas 73.6 25 3 Asia 70.7 33 4 Europe 77.6 30 5 Oceania 80.7 2 Saving 10 x 6 in image

1.6 Exercises

Exercise 1.1: Setup Verification

Write code to accomplish these tasks:

  1. Print "Hello, ProTools ER1!" to the console
  2. Calculate 210 (2 to the power of 10)
  3. Create a list/vector of numbers 1-5 and calculate their sum
Your Solution Write your code below, then check your score

Exercise 1.2: Gapminder Exploration

Using the Gapminder dataset, answer these questions in your preferred language:

  1. How many unique countries are in the dataset?
  2. What was the average GDP per capita in 1952 vs 2007?
  3. Which country had the highest life expectancy in 2007?
  4. Create a line plot showing life expectancy over time for a country of your choice
Hint

Load the data from: https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv

Exercise 1.3: Package Installation

Install the following packages, which I use throughout the course:

  • Python: pandas, numpy, matplotlib, seaborn, statsmodels
  • Stata: outreg2, rdrobust, estout
  • R: tidyverse, fixest, modelsummary, haven
Further Reading