2 Data Harnessing
Learning Objectives
- Load data from CSV, Excel, Stata, and Parquet formats
- Merge and reshape datasets for analysis
- Access data programmatically through APIs
- Understand the legal and ethical framework for web scraping
- Extract data from web pages using HTML parsing
Course Research Project
Throughout this course, we build a complete research project: "Climate Vulnerability and Economic Growth." This module focuses on acquiring the data we'll later explore and analyze.
What is Data Harnessing?
Data harnessing is the process of acquiring data from various sources and preparing it for analysis. This is the critical first step in any research project—before you can explore, clean, or analyze data, you need to get it.
This module covers three methods of data acquisition, ordered from simplest to most complex:
Three Paths to Data
Path A: File Import
Data already collected and stored in files
- CSV, Excel, Stata .dta files
- Survey datasets (IPUMS, etc.)
- Published replication data
Skills: Loading, merging, reshaping
Path B: APIs
Programmatic access to online databases
- World Bank, FRED, Census
- Structured, official access
- Real-time or updated data
Skills: HTTP requests, JSON, authentication
Path C: Web Scraping
Extract data from web pages (last resort)
- When no API/download exists
- Requires legal awareness
- HTML parsing techniques
Skills: HTML, BeautifulSoup/rvest, ethics
Module Structure
This module is divided into three subpages, one for each data acquisition method:
Start here. Learn to load, merge, and reshape data from common file formats:
- Loading CSV, Excel, and Stata files
- Merging datasets with different join types
- Reshaping between wide and long formats
Fetch data programmatically from online databases:
- Understanding what APIs are and how they work
- Making HTTP requests to retrieve data
- Handling JSON responses and authentication
Extract data from web pages when no other option exists:
- Legal and ethical considerations (Terms of Service, robots.txt, copyright)
- HTML basics and CSS selectors
- BeautifulSoup (Python) and rvest (R) for parsing
- Handling pagination and dynamic content
Which Subpage Should I Start With?
| Your Situation | Recommended Path |
|---|---|
| "I have a .csv/.xlsx/.dta file to work with" | 2a: File Import |
| "I need World Bank / FRED / Census data" | 2b: APIs |
| "The data I need is only on a website" | 2c: Web Scraping (but check for APIs first!) |
| "I'm new to programming" | 2a first, then 2b, then 2c |
Once you've acquired your data, you'll want to understand it before cleaning or analyzing. Module 3: Data Exploration teaches you to build a "First Analysis" script that inspects, summarizes, and visualizes any dataset. This diagnostic step should happen before you start cleaning (Module 4) or formal analysis (Module 5).