2a Importing Data from Files
Learning Objectives
- Load data from CSV, Excel, Stata, and other common formats
- Understand different file encodings and handle them correctly
- Merge multiple datasets using key variables
- Reshape data between wide and long formats
Common Data Formats in Economics Research
Before loading data, you need to understand what format it's in. Here are the most common formats you'll encounter:
CSV .csv
Comma-Separated Values. Plain text, universal compatibility.
Use: Sharing data, web downloads, exports
Excel .xlsx / .xls
Microsoft spreadsheet format. Can have multiple sheets.
Use: Business data, government reports
Stata .dta
Stata's native format. Preserves variable labels and types.
Use: Academic datasets, IPUMS, replication files
R Data .rds / .RData
R's native formats. .rds = single object, .RData = multiple objects.
Use: Saving R work, sharing with R users
JSON .json
JavaScript Object Notation. Hierarchical/nested data.
Use: API responses, web data, configuration
Fixed-width .txt / .dat
Columns defined by character positions. No delimiters.
Use: Census data, older government files
Tip: For a comprehensive guide on all file formats (data files, scripts, notebooks), see Module 0: File Formats Reference.
Loading Data from Files
What Happens When You "Load" Data?
Loading data reads the file from disk into your program's memory as a structured object (DataFrame in Python/R, dataset in Stata). The original file remains unchanged—you're working with a copy. Any changes you make only exist in memory until you explicitly save them.
Loading CSV Files
CSV is the most common format. Here's how to load it in each language:
# Python: Load CSV with pandas
import pandas as pd
# Basic CSV load from URL
df = pd.read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")
# Load from local file
df = pd.read_csv("data/my_data.csv")
# Handle common issues:
df = pd.read_csv("data.csv",
encoding="utf-8", # or "latin-1", "cp1252"
sep=",", # delimiter (try ";" or "\t")
na_values=["", "NA", ".", "-999"], # missing value codes
dtype={"id": str} # force column types
)
# Verify the load
print(f"Loaded {len(df):,} rows, {len(df.columns)} columns")
print(df.head())
* Stata: Load CSV files
* Basic CSV load from URL
import delimited "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv", clear
* Load from local file
import delimited "data/my_data.csv", clear
* Handle common issues:
import delimited "data.csv", ///
encoding(utf-8) /// // or latin1, windows-1252
delimiter(",") /// // try ";" or tab
varnames(1) /// // first row = headers
stringcols(1) /// // force col 1 as string
clear
* Verify the load
describe, short
list in 1/5
# R: Load CSV with tidyverse
library(tidyverse)
# Basic CSV load from URL
df <- read_csv("https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv")
# Load from local file
df <- read_csv("data/my_data.csv")
# Handle common issues:
df <- read_csv("data.csv",
locale = locale(encoding = "UTF-8"),
na = c("", "NA", ".", "-999"),
col_types = cols(id = col_character())
)
# Verify the load
cat(sprintf("Loaded %s rows, %d columns\n",
format(nrow(df), big.mark=","), ncol(df)))
head(df)
Loading Excel and Stata Files
# Python: Load Excel and Stata files
import pandas as pd
# Excel files
df = pd.read_excel("data.xlsx") # First sheet
df = pd.read_excel("data.xlsx", sheet_name="Sheet2") # Specific sheet
df = pd.read_excel("data.xlsx", sheet_name=0) # By index
# Load all sheets into a dictionary
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)
# Stata files (preserves labels!)
df = pd.read_stata("data.dta")
# For large Stata files, use iterator
for chunk in pd.read_stata("big_data.dta", chunksize=10000):
# Process each chunk
pass
# JSON files (common from APIs)
df = pd.read_json("data.json")
* Stata: Load different file formats
* Stata native format (.dta)
use "data.dta", clear
* Load from URL
use "https://example.com/data.dta", clear
* Load only specific variables
use country year gdp using "data.dta", clear
* Load with condition (faster than loading all then dropping)
use "data.dta" if year >= 2000, clear
* Excel files
import excel "data.xlsx", firstrow clear
import excel "data.xlsx", sheet("Sheet2") firstrow clear
* View available sheets
import excel "data.xlsx", describe
# R: Load different file formats
library(tidyverse)
library(readxl) # For Excel files
library(haven) # For Stata files
# Excel files
df <- read_excel("data.xlsx") # First sheet
df <- read_excel("data.xlsx", sheet = "Sheet2") # Specific sheet
# List available sheets
excel_sheets("data.xlsx")
# Load all sheets into a list
sheets <- excel_sheets("data.xlsx")
all_data <- map(sheets, ~read_excel("data.xlsx", sheet = .x))
names(all_data) <- sheets
# Stata files (preserves labels!)
df <- read_dta("data.dta")
# R native formats
df <- readRDS("data.rds") # Single object
load("workspace.RData") # Multiple objects
Merging Datasets
What is Merging?
Merging (or joining) combines two datasets based on common key variables. For example, you might merge GDP data with population data using country code and year as keys. The key variables must uniquely identify observations in at least one of the datasets.
Types of Merges
Interactive Merge Visualizer
Select a merge type to see how the two datasets combine:
| country | gdp |
|---|---|
| USA | 21000 |
| CHN | 14700 |
| DEU | 3800 |
| country | pop |
|---|---|
| USA | 331 |
| CHN | 1400 |
| JPN | 126 |
| country | gdp | pop |
|---|---|---|
| USA | 21000 | 331 |
| CHN | 14700 | 1400 |
Inner join: Only rows with matching keys in BOTH datasets are kept (2 rows).
| Type | Keeps | Use When |
|---|---|---|
| Inner | Only matching rows | You only want complete cases |
| Left | All from left + matches from right | Keep all your main data, add info where available |
| Right | All from right + matches from left | Opposite of left join |
| Outer (Full) | All rows from both | You want to preserve everything |
Terminology Across Languages
Merging operations have different names in Python, Stata, and R. Use this table to translate:
| Operation | Python (pandas) | Stata | R (dplyr) |
|---|---|---|---|
| Inner Join | pd.merge(..., how='inner') |
merge ..., keep(3) |
inner_join() |
| Left Join | pd.merge(..., how='left') |
merge ..., keep(1 3) |
left_join() |
| Right Join | pd.merge(..., how='right') |
merge ..., keep(2 3) |
right_join() |
| Full/Outer Join | pd.merge(..., how='outer') |
merge ... (default) |
full_join() |
| Match type | validate='1:1', 'm:1', '1:m' |
1:1, m:1, 1:m, m:m |
relationship='one-to-one', etc. |
Stata note: _merge values: 1 = master only, 2 = using only, 3 = matched in both.
# Python: Merging datasets with pandas
import pandas as pd
# Create example datasets
gdp = pd.DataFrame({
'country': ['USA', 'CHN', 'DEU'],
'year': [2020, 2020, 2020],
'gdp': [21000, 14700, 3800]
})
pop = pd.DataFrame({
'country': ['USA', 'CHN', 'JPN'],
'year': [2020, 2020, 2020],
'population': [331, 1400, 126]
})
# Inner join (only matches)
merged = pd.merge(gdp, pop, on=['country', 'year'])
# Left join (keep all from gdp)
merged_left = pd.merge(gdp, pop, on=['country', 'year'], how='left')
# Outer join (keep all)
merged_outer = pd.merge(gdp, pop, on=['country', 'year'], how='outer')
# When key columns have different names
merged = pd.merge(gdp, pop,
left_on=['country', 'year'],
right_on=['iso3', 'yr'])
# Check merge quality
print(f"GDP rows: {len(gdp)}, Pop rows: {len(pop)}, Merged: {len(merged)}")
* Stata: Merging datasets
* Load master dataset first
use "gdp.dta", clear
* Basic 1:1 merge (one observation per key in both)
merge 1:1 country year using "population.dta"
* Check merge results (ALWAYS DO THIS!)
tab _merge
* m:1 merge (many-to-one: many in master, one in using)
* Example: household data merged with region characteristics
merge m:1 region using "region_data.dta"
* 1:m merge (one-to-many)
merge 1:m country using "yearly_data.dta"
* Keep only matched observations
keep if _merge == 3
drop _merge
* Or assert all should match
assert _merge == 3
drop _merge
# R: Merging datasets with dplyr
library(tidyverse)
# Create example datasets
gdp <- tibble(
country = c("USA", "CHN", "DEU"),
year = c(2020, 2020, 2020),
gdp = c(21000, 14700, 3800)
)
pop <- tibble(
country = c("USA", "CHN", "JPN"),
year = c(2020, 2020, 2020),
population = c(331, 1400, 126)
)
# Inner join (only matches)
merged <- inner_join(gdp, pop, by = c("country", "year"))
# Left join (keep all from gdp)
merged_left <- left_join(gdp, pop, by = c("country", "year"))
# Full/outer join (keep all)
merged_full <- full_join(gdp, pop, by = c("country", "year"))
# When key columns have different names
merged <- left_join(gdp, pop, by = c("country" = "iso3", "year" = "yr"))
# Check merge quality
cat(sprintf("GDP: %d, Pop: %d, Merged: %d\n",
nrow(gdp), nrow(pop), nrow(merged)))
Reshaping Data: Wide vs Long
Wide vs Long Format
Wide format: One row per unit, with separate columns for each time period (e.g., gdp_2018, gdp_2019, gdp_2020).
Long format: Multiple rows per unit, with a single column for the variable and another for time period. Most statistical analyses require long format.
Interactive Reshape Visualizer
Click the button to toggle between wide and long format:
| country | gdp_2018 | gdp_2019 | gdp_2020 |
|---|---|---|---|
| USA | 20500 | 21400 | 21000 |
| CHN | 13900 | 14300 | 14700 |
2 rows × 4 columns — Each year is a separate column
Good for presentation, side-by-side comparison
# Python: Reshaping with pandas
import pandas as pd
# Example wide data
wide = pd.DataFrame({
'country': ['USA', 'CHN'],
'gdp_2018': [20500, 13900],
'gdp_2019': [21400, 14300],
'gdp_2020': [21000, 14700]
})
# Wide to Long using melt()
long = pd.melt(wide,
id_vars=['country'], # Columns to keep
var_name='year', # Name for the new 'variable' column
value_name='gdp' # Name for the new 'value' column
)
# Clean up the year column (remove 'gdp_' prefix)
long['year'] = long['year'].str.replace('gdp_', '').astype(int)
# Long to Wide using pivot()
wide_again = long.pivot(
index='country', # Row identifier
columns='year', # Create columns from this
values='gdp' # Values to fill cells
).reset_index()
print("Long format:")
print(long)
* Stata: Reshaping data
* Example: Start with wide data
* Variables: country, gdp2018, gdp2019, gdp2020
* Wide to Long
reshape long gdp, i(country) j(year)
* i() = identifier variables (rows)
* j() = new variable to hold the suffix (2018, 2019, 2020)
* 'gdp' = stub of the variables to reshape
* Long to Wide
reshape wide gdp, i(country) j(year)
* Verify the reshape
list in 1/6
* Multiple variables at once
* If you have gdp2018 gdp2019 pop2018 pop2019:
reshape long gdp pop, i(country) j(year)
# R: Reshaping with tidyr
library(tidyverse)
# Example wide data
wide <- tibble(
country = c("USA", "CHN"),
gdp_2018 = c(20500, 13900),
gdp_2019 = c(21400, 14300),
gdp_2020 = c(21000, 14700)
)
# Wide to Long using pivot_longer()
long <- wide %>%
pivot_longer(
cols = starts_with("gdp_"), # Columns to pivot
names_to = "year", # New column for names
values_to = "gdp", # New column for values
names_prefix = "gdp_" # Remove this prefix
) %>%
mutate(year = as.integer(year))
# Long to Wide using pivot_wider()
wide_again <- long %>%
pivot_wider(
names_from = year, # Column to spread
values_from = gdp, # Values to fill
names_prefix = "gdp_" # Add prefix to column names
)
print(long)
Exercise
Exercise 2a.1: Data Import and Merging
Practice the core data import skills. Complete these tasks:
- Load a CSV file into a dataframe
- Perform a merge/join operation on two datasets
- Reshape data from wide to long format (or vice versa)