1  Getting Started

~4 hours Setup & Basics Beginner

Learning Objectives

  • Install and configure Python, Stata, and R environments
  • Understand the differences between the three languages
  • Master basic syntax: variables, data types, and operations
  • Install and load packages/libraries
  • Write and execute your first data analysis script

1.1 Why Learn Three Languages?

You might wonder why I teach Python, Stata, and R simultaneously rather than focusing on just one. The answer is practical: different fields and organizations have standardized on different tools, and being fluent in multiple languages makes you a more versatile researcher.

Learning all three gives you flexibility in your career and allows you to collaborate with researchers across disciplines. You'll also find that concepts learned in one language transfer easily to others.

Installation & Setup

If you haven't already set up your programming environment, please refer to Module 0: Languages & Platforms for detailed installation guides for each language and IDE. That module covers:

1.2 Basic Syntax Comparison

Let's start by comparing basic syntax across all three languages. Understanding these fundamentals will make everything else easier.

Assigning Variables

Variables store values that you can reference and manipulate throughout your code. Here's how to create variables in each language:

# Python: Variables and basic types

# Numeric types
age = 25                    # Integer
income = 55000.50           # Float

# String
name = "Alice"              # String (text)

# Boolean
is_employed = True          # Boolean (True/False)

# Lists (arrays)
scores = [85, 92, 78, 95]   # List of numbers

# Dictionary (key-value pairs)
person = {
    "name": "Alice",
    "age": 25,
    "city": "Boston"
}

# Print values
print(f"Name: {name}, Age: {age}")
print(f"Average score: {sum(scores)/len(scores)}")
* Stata: Variables and basic types

* Stata works with datasets, not individual variables
* First, let's create some local macros (temporary variables)

local age = 25
local income = 55000.50
local name "Alice"

* Display values
display "Name: `name', Age: `age'"

* Global macros persist across programs
global project_name "My Analysis"
display "$project_name"

* Create a small dataset
clear
input score
    85
    92
    78
    95
end

* Calculate mean
summarize score
display "Average score: " r(mean)
# R: Variables and basic types

# Numeric types
age <- 25                    # Integer
income <- 55000.50           # Numeric

# Character (string)
name <- "Alice"              # Character

# Logical (boolean)
is_employed <- TRUE          # Logical

# Vectors (R's primary data structure)
scores <- c(85, 92, 78, 95)   # Numeric vector

# List (like Python dictionary)
person <- list(
  name = "Alice",
  age = 25,
  city = "Boston"
)

# Print values
cat("Name:", name, ", Age:", age, "\n")
cat("Average score:", mean(scores), "\n")
Python Output
Name: Alice, Age: 25 Average score: 87.5
Stata Output
. display "Name: `name', Age: `age'" Name: Alice, Age: 25 . display "$project_name" My Analysis . summarize score Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- score | 4 87.5 7.593857 78 95 . display "Average score: " r(mean) Average score: 87.5
R Output
Name: Alice , Age: 25 Average score: 87.5

Basic Arithmetic Operations

All three languages support standard mathematical operations. Let's see how they compare:

# Python: Arithmetic operations

a = 10
b = 3

print(f"Addition: {a + b}")        # 13
print(f"Subtraction: {a - b}")     # 7
print(f"Multiplication: {a * b}")  # 30
print(f"Division: {a / b}")        # 3.333...
print(f"Integer division: {a // b}") # 3
print(f"Modulo: {a % b}")          # 1
print(f"Power: {a ** b}")          # 1000

# Comparison operators
print(f"a > b: {a > b}")          # True
print(f"a == b: {a == b}")        # False
print(f"a != b: {a != b}")        # True
* Stata: Arithmetic operations

local a = 10
local b = 3

display "Addition: " `a' + `b'           // 13
display "Subtraction: " `a' - `b'        // 7
display "Multiplication: " `a' * `b'     // 30
display "Division: " `a' / `b'           // 3.333...
display "Integer division: " floor(`a'/`b') // 3
display "Modulo: " mod(`a', `b')        // 1
display "Power: "  `a'^`b'               // 1000

* Comparison operators (return 1 for true, 0 for false)
display "a > b: " (`a' > `b')           // 1
display "a == b: " (`a' == `b')         // 0
display "a != b: " (`a' != `b')         // 1
# R: Arithmetic operations

a <- 10
b <- 3

cat("Addition:", a + b, "\n")          # 13
cat("Subtraction:", a - b, "\n")       # 7
cat("Multiplication:", a * b, "\n")    # 30
cat("Division:", a / b, "\n")          # 3.333...
cat("Integer division:", a %/% b, "\n") # 3
cat("Modulo:", a %% b, "\n")           # 1
cat("Power:", a ^ b, "\n")             # 1000

# Comparison operators
cat("a > b:", a > b, "\n")            # TRUE
cat("a == b:", a == b, "\n")          # FALSE
cat("a != b:", a != b, "\n")          # TRUE
Python Output
Addition: 13 Subtraction: 7 Multiplication: 30 Division: 3.3333333333333335 Integer division: 3 Modulo: 1 Power: 1000 a > b: True a == b: False a != b: True
Stata Output
. display "Addition: " `a' + `b' Addition: 13 . display "Subtraction: " `a' - `b' Subtraction: 7 . display "Multiplication: " `a' * `b' Multiplication: 30 . display "Division: " `a' / `b' Division: 3.3333333 . display "Integer division: " floor(`a'/`b') Integer division: 3 . display "Modulo: " mod(`a', `b') Modulo: 1 . display "Power: " `a'^`b' Power: 1000 . display "a > b: " (`a' > `b') a > b: 1 . display "a == b: " (`a' == `b') a == b: 0 . display "a != b: " (`a' != `b') a != b: 1
R Output
Addition: 13 Subtraction: 7 Multiplication: 30 Division: 3.333333 Integer division: 3 Modulo: 1 Power: 1000 a > b: TRUE a == b: FALSE a != b: TRUE

1.3 Package Management

One of the most powerful features of all three languages is their extensibility through packages (also called libraries or modules). Packages provide pre-written functions that save you from reinventing the wheel.

Essential Packages for ProTools ER1

Purpose Python Stata R
Data manipulation pandas Built-in dplyr, tidyr
Visualization matplotlib, seaborn Built-in ggplot2
Statistics scipy, statsmodels Built-in Built-in, broom
Causal inference causalinference, econml rdrobust, did fixest, rdrobust

Installing and Loading Packages

# Python: Installing and importing packages

# Install from command line/terminal:
# pip install pandas numpy matplotlib seaborn

# Import packages in your script
import pandas as pd        # Data manipulation
import numpy as np         # Numerical computing
import matplotlib.pyplot as plt  # Plotting
import seaborn as sns       # Statistical visualization

# Import specific functions
from scipy import stats
from statsmodels.formula.api import ols

# Check package version
print(f"Pandas version: {pd.__version__}")
* Stata: Installing and using packages

* Most Stata functionality is built-in
* User-written packages come from SSC (Statistical Software Components)

* Install a package from SSC
ssc install rdrobust      // RD estimation
ssc install outreg2       // Export regression tables
ssc install estout        // More regression output options

* Install from GitHub (using net install)
net install did, from("https://raw.githubusercontent.com/bcallaway11/did/master/stata/") replace

* Update all installed packages
adoupdate, update

* Search for packages
search difference-in-differences

* Check installed packages
ado dir
# R: Installing and loading packages

# Install packages (only need to do once)
install.packages("tidyverse")   # Includes dplyr, ggplot2, tidyr, etc.
install.packages("fixest")      # Fast fixed effects estimation
install.packages("rdrobust")    # Regression discontinuity

# Load packages (do every session)
library(tidyverse)   # Data manipulation & viz
library(fixest)      # Econometrics
library(haven)       # Read Stata/SPSS/SAS files

# Install from GitHub
# install.packages("devtools")
# devtools::install_github("author/package")

# Check package version
packageVersion("tidyverse")
Python Output
Pandas version: 2.1.4
Stata Output
. ssc install rdrobust checking rdrobust consistency and samples installing into c:\ado\plus\... installation complete. . ado dir [1] package rdrobust from http://fmwww.bc.edu/repec/bocode/r 'RDROBUST': module to provide robust data-driven inference
R Output
── Attaching core tidyverse packages ──────────────── tidyverse 2.0.0 ── βœ” dplyr 1.1.4 βœ” readr 2.1.5 βœ” forcats 1.0.0 βœ” stringr 1.5.1 βœ” ggplot2 3.5.0 βœ” tibble 3.2.1 βœ” lubridate 1.9.3 βœ” tidyr 1.3.1 βœ” purrr 1.0.2 ── Conflicts ────────────────────────────────── tidyverse_conflicts() ── βœ– dplyr::filter() masks stats::filter() βœ– dplyr::lag() masks stats::lag() β„Ή Use the conflicted package to force all conflicts to become errors [1] '2.0.0'

1.4 Your First Script

Let's put everything together by writing a simple script that loads data, performs basic calculations, and creates a visualization. We'll use the famous Gapminder dataset, which contains life expectancy, GDP per capita, and population data for countries over time.

Dataset: Gapminder

The Gapminder dataset was popularized by Hans Rosling's famous TED talks. It's an excellent dataset for learning because it's clean, intuitive, and contains both numeric and categorical variables. Access it via the Gapminder website or through packages in each language.

# Python: First script with Gapminder data
# Script: 01_gapminder_analysis.py

# Import packages
import pandas as pd
import matplotlib.pyplot as plt

# Load data (using gapminder from plotly for convenience)
from gapminder import gapminder

# Alternatively, load from URL:
# url = "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv"
# gapminder = pd.read_csv(url, sep='\t')

# Explore the data
print("First 5 rows:")
print(gapminder.head())

print("\nDataset shape:", gapminder.shape)
print("\nColumn types:")
print(gapminder.dtypes)

# Summary statistics
print("\nSummary statistics:")
print(gapminder.describe())

# Filter to year 2007
gap_2007 = gapminder[gapminder['year'] == 2007]

# Calculate average life expectancy by continent
life_exp_by_continent = gap_2007.groupby('continent')['lifeExp'].mean()
print("\nLife expectancy by continent (2007):")
print(life_exp_by_continent)

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(gap_2007['gdpPercap'], gap_2007['lifeExp'],
            s=gap_2007['pop']/1e6, alpha=0.6)
plt.xlabel('GDP per capita')
plt.ylabel('Life expectancy')
plt.title('Life Expectancy vs GDP per Capita (2007)')
plt.xscale('log')
plt.savefig('gapminder_plot.png', dpi=150)
plt.show()
* Stata: First script with Gapminder data
* Script: 01_gapminder_analysis.do

* Set working directory (change to your path)
cd "~/Documents/protools_er1"

* Load Gapminder data from URL
import delimited "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv", clear

* Explore the data
describe
list in 1/5

* Summary statistics
summarize
summarize lifeexp gdppercap pop, detail

* Filter to year 2007
keep if year == 2007

* Calculate average life expectancy by continent
table continent, statistic(mean lifeexp) nformat(%5.1f)

* Alternative: collapse to create summary dataset
preserve
collapse (mean) lifeexp, by(continent)
list
restore

* Create a scatter plot
twoway (scatter lifeexp gdppercap [w=pop], msymbol(Oh)) ///
    , xscale(log) ///
    xtitle("GDP per capita (log scale)") ///
    ytitle("Life expectancy") ///
    title("Life Expectancy vs GDP per Capita (2007)")

graph export "gapminder_plot.png", replace
# R: First script with Gapminder data
# Script: 01_gapminder_analysis.R

# Load packages
library(tidyverse)
library(gapminder)  # install.packages("gapminder")

# Explore the data
head(gapminder)
glimpse(gapminder)

# Summary statistics
summary(gapminder)

# Filter to year 2007
gap_2007 <- gapminder %>%
  filter(year == 2007)

# Calculate average life expectancy by continent
life_exp_by_continent <- gap_2007 %>%
  group_by(continent) %>%
  summarize(
    mean_life_exp = mean(lifeExp),
    n_countries = n()
  )

print(life_exp_by_continent)

# Create a scatter plot with ggplot2
ggplot(gap_2007, aes(x = gdpPercap, y = lifeExp,
                      size = pop, color = continent)) +
  geom_point(alpha = 0.7) +
  scale_x_log10(labels = scales::comma) +
  scale_size(range = c(2, 12), guide = "none") +
  labs(
    x = "GDP per capita (log scale)",
    y = "Life expectancy",
    title = "Life Expectancy vs GDP per Capita (2007)",
    color = "Continent"
  ) +
  theme_minimal()

ggsave("gapminder_plot.png", width = 10, height = 6, dpi = 150)
Python Output
First 5 rows: country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 Dataset shape: (1704, 6) Column types: country object continent object year int64 lifeExp float64 pop int64 gdpPercap float64 dtype: object Life expectancy by continent (2007): continent Africa 54.806038 Americas 73.608120 Asia 70.728485 Europe 77.648600 Oceania 80.719500 Name: lifeExp, dtype: float64
Stata Output
. describe Contains data Observations: 1,704 Variables: 6 Variable Storage Display Value name type format label ─────────────────────────────────────────── country str24 %24s continent str8 %9s year int %10.0g lifeexp double %10.0g pop double %10.0g gdppercap double %10.0g . table continent, statistic(mean lifeexp) nformat(%5.1f) ───────────────────────── continent Mean(lifeexp) ───────────────────────── Africa 54.8 Americas 73.6 Asia 70.7 Europe 77.6 Oceania 80.7 ───────────────────────── (file gapminder_plot.png written in PNG format)
R Output
> head(gapminder) # A tibble: 6 Γ— 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. > print(life_exp_by_continent) # A tibble: 5 Γ— 3 continent mean_life_exp n_countries <fct> <dbl> <int> 1 Africa 54.8 52 2 Americas 73.6 25 3 Asia 70.7 33 4 Europe 77.6 30 5 Oceania 80.7 2 Saving 10 x 6 in image

1.5 Exercises

Exercise 1.1: Setup Verification

Write code to accomplish these tasks:

  1. Print "Hello, ProTools ER1!" to the console
  2. Calculate 210 (2 to the power of 10)
  3. Create a list/vector of numbers 1-5 and calculate their sum
Your Solution Write your code below, then check your score

Exercise 1.2: Gapminder Exploration

Using the Gapminder dataset, answer these questions in your preferred language:

  1. How many unique countries are in the dataset?
  2. What was the average GDP per capita in 1952 vs 2007?
  3. Which country had the highest life expectancy in 2007?
  4. Create a line plot showing life expectancy over time for a country of your choice
Hint

Load the data from: https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv

Exercise 1.3: Package Installation

Install the following packages, which I use throughout the course:

  • Python: pandas, numpy, matplotlib, seaborn, statsmodels
  • Stata: outreg2, rdrobust, estout
  • R: tidyverse, fixest, modelsummary, haven
Further Reading