2 Data Harnessing

~8 hours Acquisition, APIs, Web Scraping Beginner-Intermediate

Learning Objectives

Load data from CSV, Excel, Stata, and Parquet formats
Merge and reshape datasets for analysis
Access data programmatically through APIs
Understand the legal and ethical framework for web scraping
Extract data from web pages using HTML parsing

Course Research Project

Throughout this course, we build a complete research project: "Climate Vulnerability and Economic Growth." This module focuses on acquiring the data we'll later explore and analyze.

What is Data Harnessing?

Data harnessing is the process of acquiring data from various sources and preparing it for analysis. This is the critical first step in any research project—before you can explore, clean, or analyze data, you need to get it.

This module covers three methods of data acquisition, ordered from simplest to most complex:

Three Paths to Data

Path A: File Import

Data already collected and stored in files

CSV, Excel, Stata .dta files
Survey datasets (IPUMS, etc.)
Published replication data

Skills: Loading, merging, reshaping

Path B: APIs

Programmatic access to online databases

World Bank, FRED, Census
Structured, official access
Real-time or updated data

Skills: HTTP requests, JSON, authentication

Path C: Web Scraping

Extract data from web pages (last resort)

When no API/download exists
Requires legal awareness
HTML parsing techniques

Skills: HTML, BeautifulSoup/rvest, ethics

Module Structure

This module is divided into three subpages, one for each data acquisition method:

Part 2a: Importing from Files →

Start here. Learn to load, merge, and reshape data from common file formats:

Loading CSV, Excel, and Stata files
Merging datasets with different join types
Reshaping between wide and long formats

Part 2b: Working with APIs →

Fetch data programmatically from online databases:

Understanding what APIs are and how they work
Making HTTP requests to retrieve data
Handling JSON responses and authentication

Part 2c: Web Scraping →

Extract data from web pages when no other option exists:

Legal and ethical considerations (Terms of Service, robots.txt, copyright)
HTML basics and CSS selectors
BeautifulSoup (Python) and rvest (R) for parsing
Handling pagination and dynamic content

Which Subpage Should I Start With?

Your Situation	Recommended Path
"I have a .csv/.xlsx/.dta file to work with"	2a: File Import
"I need World Bank / FRED / Census data"	2b: APIs
"The data I need is only on a website"	2c: Web Scraping (but check for APIs first!)
"I'm new to programming"	2a first, then 2b, then 2c

What About Exploring Data?

Once you've acquired your data, you'll want to understand it before cleaning or analyzing. Module 3: Data Exploration teaches you to build a "First Analysis" script that inspects, summarizes, and visualizes any dataset. This diagnostic step should happen before you start cleaning (Module 4) or formal analysis (Module 5).

ProTools ER1

Course Modules

2 Data Harnessing

Learning Objectives

Course Research Project

What is Data Harnessing?

Three Paths to Data

Path A: File Import

Path B: APIs

Path C: Web Scraping

Module Structure

Which Subpage Should I Start With?

ProTools ER1 Assistant