1 Getting Started
Learning Objectives
- Install and configure Python, Stata, and R environments
- Understand the differences between the three languages
- Master basic syntax: variables, data types, and operations
- Install and load packages/libraries
- Write and execute your first data analysis script
Table of Contents
1.1 Why Learn Three Languages?
You might wonder why I teach Python, Stata, and R simultaneously rather than focusing on just one. The answer is practical: different fields and organizations have standardized on different tools, and being fluent in multiple languages makes you a more versatile researcher.
Learning all three gives you flexibility in your career and allows you to collaborate with researchers across disciplines. You'll also find that concepts learned in one language transfer easily to others.
If you haven't already set up your programming environment, please refer to Module 0: Languages & Platforms for detailed installation guides for each language and IDE. That module covers:
- VS Code for Python development
- Stata interface and setup
- RStudio for R programming
- Jupyter Notebooks for interactive coding
1.2 Basic Syntax Comparison
Let's start by comparing basic syntax across all three languages. Understanding these fundamentals will make everything else easier.
Assigning Variables
Variables store values that you can reference and manipulate throughout your code. Here's how to create variables in each language:
# Python: Variables and basic types
# Numeric types
age = 25 # Integer
income = 55000.50 # Float
# String
name = "Alice" # String (text)
# Boolean
is_employed = True # Boolean (True/False)
# Lists (arrays)
scores = [85, 92, 78, 95] # List of numbers
# Dictionary (key-value pairs)
person = {
"name": "Alice",
"age": 25,
"city": "Boston"
}
# Print values
print(f"Name: {name}, Age: {age}")
print(f"Average score: {sum(scores)/len(scores)}")
* Stata: Variables and basic types
* Stata works with datasets, not individual variables
* First, let's create some local macros (temporary variables)
local age = 25
local income = 55000.50
local name "Alice"
* Display values
display "Name: `name', Age: `age'"
* Global macros persist across programs
global project_name "My Analysis"
display "$project_name"
* Create a small dataset
clear
input score
85
92
78
95
end
* Calculate mean
summarize score
display "Average score: " r(mean)
# R: Variables and basic types
# Numeric types
age <- 25 # Integer
income <- 55000.50 # Numeric
# Character (string)
name <- "Alice" # Character
# Logical (boolean)
is_employed <- TRUE # Logical
# Vectors (R's primary data structure)
scores <- c(85, 92, 78, 95) # Numeric vector
# List (like Python dictionary)
person <- list(
name = "Alice",
age = 25,
city = "Boston"
)
# Print values
cat("Name:", name, ", Age:", age, "\n")
cat("Average score:", mean(scores), "\n")
Basic Arithmetic Operations
All three languages support standard mathematical operations. Let's see how they compare:
# Python: Arithmetic operations
a = 10
b = 3
print(f"Addition: {a + b}") # 13
print(f"Subtraction: {a - b}") # 7
print(f"Multiplication: {a * b}") # 30
print(f"Division: {a / b}") # 3.333...
print(f"Integer division: {a // b}") # 3
print(f"Modulo: {a % b}") # 1
print(f"Power: {a ** b}") # 1000
# Comparison operators
print(f"a > b: {a > b}") # True
print(f"a == b: {a == b}") # False
print(f"a != b: {a != b}") # True
* Stata: Arithmetic operations
local a = 10
local b = 3
display "Addition: " `a' + `b' // 13
display "Subtraction: " `a' - `b' // 7
display "Multiplication: " `a' * `b' // 30
display "Division: " `a' / `b' // 3.333...
display "Integer division: " floor(`a'/`b') // 3
display "Modulo: " mod(`a', `b') // 1
display "Power: " `a'^`b' // 1000
* Comparison operators (return 1 for true, 0 for false)
display "a > b: " (`a' > `b') // 1
display "a == b: " (`a' == `b') // 0
display "a != b: " (`a' != `b') // 1
# R: Arithmetic operations
a <- 10
b <- 3
cat("Addition:", a + b, "\n") # 13
cat("Subtraction:", a - b, "\n") # 7
cat("Multiplication:", a * b, "\n") # 30
cat("Division:", a / b, "\n") # 3.333...
cat("Integer division:", a %/% b, "\n") # 3
cat("Modulo:", a %% b, "\n") # 1
cat("Power:", a ^ b, "\n") # 1000
# Comparison operators
cat("a > b:", a > b, "\n") # TRUE
cat("a == b:", a == b, "\n") # FALSE
cat("a != b:", a != b, "\n") # TRUE
1.3 Package Management
One of the most powerful features of all three languages is their extensibility through packages (also called libraries or modules). Packages provide pre-written functions that save you from reinventing the wheel.
Essential Packages for ProTools ER1
| Purpose | Python | Stata | R |
|---|---|---|---|
| Data manipulation | pandas |
Built-in | dplyr, tidyr |
| Visualization | matplotlib, seaborn |
Built-in | ggplot2 |
| Statistics | scipy, statsmodels |
Built-in | Built-in, broom |
| Causal inference | causalinference, econml |
rdrobust, did |
fixest, rdrobust |
Installing and Loading Packages
# Python: Installing and importing packages
# Install from command line/terminal:
# pip install pandas numpy matplotlib seaborn
# Import packages in your script
import pandas as pd # Data manipulation
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical visualization
# Import specific functions
from scipy import stats
from statsmodels.formula.api import ols
# Check package version
print(f"Pandas version: {pd.__version__}")
* Stata: Installing and using packages
* Most Stata functionality is built-in
* User-written packages come from SSC (Statistical Software Components)
* Install a package from SSC
ssc install rdrobust // RD estimation
ssc install outreg2 // Export regression tables
ssc install estout // More regression output options
* Install from GitHub (using net install)
net install did, from("https://raw.githubusercontent.com/bcallaway11/did/master/stata/") replace
* Update all installed packages
adoupdate, update
* Search for packages
search difference-in-differences
* Check installed packages
ado dir
# R: Installing and loading packages
# Install packages (only need to do once)
install.packages("tidyverse") # Includes dplyr, ggplot2, tidyr, etc.
install.packages("fixest") # Fast fixed effects estimation
install.packages("rdrobust") # Regression discontinuity
# Load packages (do every session)
library(tidyverse) # Data manipulation & viz
library(fixest) # Econometrics
library(haven) # Read Stata/SPSS/SAS files
# Install from GitHub
# install.packages("devtools")
# devtools::install_github("author/package")
# Check package version
packageVersion("tidyverse")
1.4 Your First Script
Let's put everything together by writing a simple script that loads data, performs basic calculations, and creates a visualization. We'll use the famous Gapminder dataset, which contains life expectancy, GDP per capita, and population data for countries over time.
The Gapminder dataset was popularized by Hans Rosling's famous TED talks. It's an excellent dataset for learning because it's clean, intuitive, and contains both numeric and categorical variables. Access it via the Gapminder website or through packages in each language.
# Python: First script with Gapminder data
# Script: 01_gapminder_analysis.py
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
# Load data (using gapminder from plotly for convenience)
from gapminder import gapminder
# Alternatively, load from URL:
# url = "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv"
# gapminder = pd.read_csv(url, sep='\t')
# Explore the data
print("First 5 rows:")
print(gapminder.head())
print("\nDataset shape:", gapminder.shape)
print("\nColumn types:")
print(gapminder.dtypes)
# Summary statistics
print("\nSummary statistics:")
print(gapminder.describe())
# Filter to year 2007
gap_2007 = gapminder[gapminder['year'] == 2007]
# Calculate average life expectancy by continent
life_exp_by_continent = gap_2007.groupby('continent')['lifeExp'].mean()
print("\nLife expectancy by continent (2007):")
print(life_exp_by_continent)
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(gap_2007['gdpPercap'], gap_2007['lifeExp'],
s=gap_2007['pop']/1e6, alpha=0.6)
plt.xlabel('GDP per capita')
plt.ylabel('Life expectancy')
plt.title('Life Expectancy vs GDP per Capita (2007)')
plt.xscale('log')
plt.savefig('gapminder_plot.png', dpi=150)
plt.show()
* Stata: First script with Gapminder data
* Script: 01_gapminder_analysis.do
* Set working directory (change to your path)
cd "~/Documents/protools_er1"
* Load Gapminder data from URL
import delimited "https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv", clear
* Explore the data
describe
list in 1/5
* Summary statistics
summarize
summarize lifeexp gdppercap pop, detail
* Filter to year 2007
keep if year == 2007
* Calculate average life expectancy by continent
table continent, statistic(mean lifeexp) nformat(%5.1f)
* Alternative: collapse to create summary dataset
preserve
collapse (mean) lifeexp, by(continent)
list
restore
* Create a scatter plot
twoway (scatter lifeexp gdppercap [w=pop], msymbol(Oh)) ///
, xscale(log) ///
xtitle("GDP per capita (log scale)") ///
ytitle("Life expectancy") ///
title("Life Expectancy vs GDP per Capita (2007)")
graph export "gapminder_plot.png", replace
# R: First script with Gapminder data
# Script: 01_gapminder_analysis.R
# Load packages
library(tidyverse)
library(gapminder) # install.packages("gapminder")
# Explore the data
head(gapminder)
glimpse(gapminder)
# Summary statistics
summary(gapminder)
# Filter to year 2007
gap_2007 <- gapminder %>%
filter(year == 2007)
# Calculate average life expectancy by continent
life_exp_by_continent <- gap_2007 %>%
group_by(continent) %>%
summarize(
mean_life_exp = mean(lifeExp),
n_countries = n()
)
print(life_exp_by_continent)
# Create a scatter plot with ggplot2
ggplot(gap_2007, aes(x = gdpPercap, y = lifeExp,
size = pop, color = continent)) +
geom_point(alpha = 0.7) +
scale_x_log10(labels = scales::comma) +
scale_size(range = c(2, 12), guide = "none") +
labs(
x = "GDP per capita (log scale)",
y = "Life expectancy",
title = "Life Expectancy vs GDP per Capita (2007)",
color = "Continent"
) +
theme_minimal()
ggsave("gapminder_plot.png", width = 10, height = 6, dpi = 150)
1.5 Exercises
Exercise 1.1: Setup Verification
Write code to accomplish these tasks:
- Print "Hello, ProTools ER1!" to the console
- Calculate 210 (2 to the power of 10)
- Create a list/vector of numbers 1-5 and calculate their sum
Exercise 1.2: Gapminder Exploration
Using the Gapminder dataset, answer these questions in your preferred language:
- How many unique countries are in the dataset?
- What was the average GDP per capita in 1952 vs 2007?
- Which country had the highest life expectancy in 2007?
- Create a line plot showing life expectancy over time for a country of your choice
Load the data from: https://raw.githubusercontent.com/jennybc/gapminder/main/data-raw/08_gap-every-five-years.tsv
Exercise 1.3: Package Installation
Install the following packages, which I use throughout the course:
- Python:
pandas,numpy,matplotlib,seaborn,statsmodels - Stata:
outreg2,rdrobust,estout - R:
tidyverse,fixest,modelsummary,haven
- Python for Data Analysis by Wes McKinney β Chapters 1-3
- Stata Programming Tutorial by German Rodriguez β Introduction
- R for Data Science by Hadley Wickham β Chapters 1-2