4 Data Cleaning

~6 hours Missing Data, Outliers, Validation Intermediate

Learning Objectives

Identify and handle missing values appropriately
Detect and treat outliers
Clean and manipulate string variables
Work with dates and time series data
Validate data quality and consistency

4.1 Handling Missing Data
4.2 Detecting and Treating Outliers
4.3 String Manipulation
4.4 Working with Dates
4.5 Data Validation
4.6 Case Study: CPS Data Cleaning

To follow along with the code examples in this module, first run the setup below to load the NLSW88 dataset — a real open-source economics dataset from the National Longitudinal Survey of Young Women (1988) with labor market data for 2,246 women. It contains some natural missing values, making it ideal for practicing data cleaning techniques.

import pandas as pd
import numpy as np

# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
df = pd.read_stata('https://www.stata-press.com/data/r18/nlsw88.dta')

# Rename columns to match this module's examples
df = df.rename(columns={'wage': 'income', 'grade': 'education',
                        'hours': 'hours_worked', 'industry': 'region'})
print(f"Loaded {len(df)} observations and {len(df.columns)} variables")
print(f"Columns: {list(df.columns)}")

* Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
webuse nlsw88, clear

* Rename variables to match this module's examples
rename wage income
rename grade education
rename hours hours_worked
rename industry region
describe, short

# Load the NLSW88 dataset (National Longitudinal Survey of Young Women, 1988)
library(tidyverse)
df <- haven::read_dta("https://www.stata-press.com/data/r18/nlsw88.dta")

# Rename columns to match this module's examples
df <- df %>% rename(income = wage, education = grade,
                    hours_worked = hours, region = industry)
cat("Loaded", nrow(df), "observations and", ncol(df), "variables\n")
names(df)

Python Output

>>> print(f"Loaded {len(df)} observations and {len(df.columns)} variables") Loaded 2246 observations and 17 variables >>> print(f"Columns: {list(df.columns)}") Columns: ['idcode', 'age', 'race', 'married', 'never_married', 'education', 'collgrad', 'south', 'smsa', 'c_city', 'region', 'occupation', 'union', 'income', 'hours_worked', 'ttl_exp', 'tenure']

Stata Output

. webuse nlsw88, clear (NLSW, 1988 extract) . rename wage income . rename grade education . rename hours hours_worked . rename industry region . describe, short Contains data from https://www.stata-press.com/data/r18/nlsw88.dta Observations: 2,246 Variables: 17 Sorted by:

R Output

> cat("Loaded", nrow(df), "observations and", ncol(df), "variables\n") Loaded 2246 observations and 17 variables > names(df) [1] "idcode" "age" "race" "married" [5] "never_married" "education" "collgrad" "south" [9] "smsa" "c_city" "region" "occupation" [13] "union" "income" "hours_worked" "ttl_exp" [17] "tenure"

4.1 Handling Missing Data

Missing data is ubiquitous in real-world datasets. Understanding why data is missing is crucial for choosing the appropriate handling strategy. There are three main types of missingness:

Type	Description	Example
MCAR	Missing Completely At Random	Survey responses lost due to computer glitch
MAR	Missing At Random (conditional on observed)	Income missing more often for older respondents
MNAR	Missing Not At Random	High earners refuse to report income

As usual, use your mouse to hover over the code below to see explanations of each command and understand how to analyse missingness patterns and handle missing data appropriately in Python, Stata, and R:

# Python: Handling missing data
import pandas as pd
import numpy as np

# Check for missing values
print(df.isnull().sum())           # Count by column
print(df.isnull().mean() * 100)   # Percentage missing

# Visualize missing patterns
import missingno as msno
msno.matrix(df)
msno.heatmap(df)  # Correlation of missingness

# Drop missing values
df_complete = df.dropna()                    # Drop any row with NA
df_subset = df.dropna(subset=['income'])    # Drop if income is missing

# Fill missing values
df['income'].fillna(df['income'].mean(), inplace=True)  # Mean imputation
df['income'].fillna(df['income'].median(), inplace=True) # Median imputation
df['income'].fillna(method='ffill', inplace=True)      # Forward fill

# Group-wise imputation
df['income'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)

# Create missing indicator
df['income_missing'] = df['income'].isnull().astype(int)

* Stata: Handling missing data

* Check for missing values
misstable summarize          // Summary of missing patterns
misstable patterns           // Missing patterns
mdesc                        // Requires: ssc install mdesc

* Count missing by variable
foreach var of varlist * {
    count if missing(`var')
    display "`var': " r(N) " missing"
}

* Drop observations with missing values
drop if missing(income)      // Drop if income missing
keep if !missing(income)     // Alternative

* Replace missing with mean
summarize income
replace income = r(mean) if missing(income)

* Group-wise imputation
bysort education: egen income_med = median(income)
replace income = income_med if missing(income)

* Create missing indicator
gen income_missing = missing(income)

* Multiple imputation (advanced)
mi set mlong
mi register imputed income
mi impute regress income age education, add(20)

# R: Handling missing data
library(tidyverse)
library(naniar)    # For missing data visualization

# Check for missing values
sum(is.na(df))                    # Total missing
colSums(is.na(df))                # By column
colMeans(is.na(df)) * 100        # Percentage

# Visualize missing patterns (naniar)
vis_miss(df)
gg_miss_var(df)
gg_miss_upset(df)

# Drop missing values
df_complete <- df %>% drop_na()              # All complete cases
df_subset <- df %>% drop_na(income)        # Drop if income missing

# Replace missing values
df <- df %>%
  mutate(
    income = replace_na(income, mean(income, na.rm = TRUE))
  )

# Group-wise imputation
df <- df %>%
  group_by(education) %>%
  mutate(
    income = ifelse(is.na(income), median(income, na.rm = TRUE), income)
  ) %>%
  ungroup()

# Create missing indicator
df <- df %>%
  mutate(income_missing = as.integer(is.na(income)))

Python Output

# Count missing by column id 0 name 3 income 147 education 23 age 0 dtype: int64 # Percentage missing id 0.00 name 0.30 income 14.70 education 2.30 age 0.00 dtype: float64

Stata Output

. misstable summarize Obs<. +------------------- | Unique Variable | Obs=. Obs>. Obs<. values Min Max -------------+-------------------------------------------- income | 147 0 142 . . education | 23 0 15 . . name | 3 0 997 . . . mi impute regress income age education, add(20) Univariate imputation Imputations = 20 Regression added = 20 Imputed: m=1 through m=20 updated = 0 | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------+-----------------------------------+---------- income | 853 147 147 | 1000

R Output

> sum(is.na(df)) [1] 173 > colSums(is.na(df)) id name income education age 0 3 147 23 0 > colMeans(is.na(df)) * 100 id name income education age 0.0 0.3 14.7 2.3 0.0

Mean Imputation Reduces Variance and Biases Inference

Mean imputation (replacing missing values with the column mean) is simple but statistically dangerous. If 30% of income values are missing and you fill them all with the mean, 30% of your data now has the exact same value. This compresses the distribution toward the center, artificially reducing variance.

Consequences for regression: (1) Standard errors on income become too small, producing overconfident p-values. (2) Correlations between income and other variables are attenuated (biased toward zero), because the imputed values carry no covariate information.

Better alternatives: Multiple imputation (creating several plausible filled datasets and combining results) preserves variability. Group-wise imputation (filling with the group median, shown above) is a practical middle ground. At minimum, create a missing indicator variable before imputing so you can test sensitivity: does the result change when you exclude imputed observations?

4.2 Detecting and Treating Outliers

Outliers can significantly affect your analysis. Before removing them, always investigate why they exist—they might be data errors, or they might be genuine extreme values that contain important information. The key question is whether an observation represents a valid data point or a measurement error.

A common approach is the z-score method, which measures how many standard deviations an observation is from the mean. Values with |z| > 3 are typically flagged as potential outliers, as they fall outside 99.7% of a normal distribution (Iglewicz & Hoaglin, 1993). For non-normal data, the interquartile range (IQR) method is more robust: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are flagged (Tukey, 1977).

# Python: Detecting and treating outliers
import pandas as pd
import numpy as np
from scipy import stats

# Method 1: Z-score (for approximately normal data)
z_scores = np.abs(stats.zscore(df['income']))
df_no_outliers = df[z_scores < 3]
print(f"Removed {len(df) - len(df_no_outliers)} outliers using z-score method")

# Method 2: IQR (more robust, works for non-normal data)
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_iqr_clean = df[(df['income'] >= lower_bound) & (df['income'] <= upper_bound)]

# Winsorization: Cap extreme values instead of removing
df['income_winsorized'] = df['income'].clip(lower=lower_bound, upper=upper_bound)

# Create outlier indicator for analysis
df['is_outlier'] = ((df['income'] < lower_bound) | (df['income'] > upper_bound)).astype(int)
print(f"Outliers flagged: {df['is_outlier'].sum()}")

* Stata: Detecting and treating outliers

* Method 1: Z-score (for approximately normal data)
summarize income
gen z_income = (income - r(mean)) / r(sd)
gen is_outlier_z = (abs(z_income) > 3)
count if is_outlier_z == 1

* Method 2: IQR (more robust, works for non-normal data)
summarize income, detail
scalar Q1 = r(p25)
scalar Q3 = r(p75)
scalar IQR = Q3 - Q1
scalar lower = Q1 - 1.5 * IQR
scalar upper = Q3 + 1.5 * IQR
gen is_outlier_iqr = (income < scalar(lower)) | (income > scalar(upper))

* Winsorization: Cap extreme values instead of removing
gen income_winsorized = income
replace income_winsorized = scalar(lower) if income < scalar(lower)
replace income_winsorized = scalar(upper) if income > scalar(upper)

* Drop outliers (use with caution)
* drop if is_outlier_iqr == 1  // Uncomment to actually drop

* Tabulate outlier counts
tab is_outlier_iqr

# R: Detecting and treating outliers
library(tidyverse)

# Method 1: Z-score (for approximately normal data)
df <- df %>%
  mutate(z_income = scale(income))
df_no_outliers <- df %>%
  filter(abs(z_income) < 3)
cat("Removed", nrow(df) - nrow(df_no_outliers), "outliers using z-score method\n")

# Method 2: IQR (more robust, works for non-normal data)
Q1 <- quantile(df$income, 0.25, na.rm = TRUE)
Q3 <- quantile(df$income, 0.75, na.rm = TRUE)
IQR_val <- IQR(df$income, na.rm = TRUE)
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
df_iqr_clean <- df %>%
  filter(between(income, lower_bound, upper_bound))

# Winsorization: Cap extreme values instead of removing
df <- df %>%
  mutate(income_winsorized = pmin(pmax(income, lower_bound), upper_bound))

# Create outlier indicator for analysis
df <- df %>%
  mutate(is_outlier = as.integer(income < lower_bound | income > upper_bound))
cat("Outliers flagged:", sum(df$is_outlier), "\n")

Python Output

Removed 12 outliers using z-score method Outliers flagged: 47

Stata Output

. count if is_outlier_z == 1 12 . tab is_outlier_iqr is_outlier_ | iqr | Freq. Percent Cum. ------------+----------------------------------- 0 | 953 95.30 95.30 1 | 47 4.70 100.00 ------------+----------------------------------- Total | 1,000 100.00

R Output

Removed 12 outliers using z-score method Outliers flagged: 47

When to Keep vs. Remove Outliers

Keep outliers when they represent genuine extreme values (e.g., Bill Gates in an income study) or when your research question specifically concerns extremes. Remove or cap outliers when they are data entry errors, measurement failures, or when robust estimates are needed. Always report how you handled outliers and test whether your conclusions change.

4.3 String Manipulation

Text data often requires cleaning—standardizing case, removing whitespace, extracting patterns, and more. Economic datasets frequently contain messy string variables: inconsistent country names, occupation codes with extra characters, or addresses that need parsing.

# Python: String manipulation
import pandas as pd

# Case standardization
df['country'] = df['country'].str.lower()
df['state_code'] = df['state_code'].str.upper()
df['name'] = df['name'].str.title()

# Remove whitespace
df['state'] = df['state'].str.strip()
df['address'] = df['address'].str.replace(r'\s+', ' ', regex=True)

# Extract patterns with regex
df['zip_code'] = df['address'].str.extract(r'(\d{5})')
df[['state_abbr', 'zip']] = df['address'].str.extract(r'([A-Z]{2})\s+(\d{5})')

# Replace values for standardization
df['country'] = df['country'].replace({
    'usa': 'united states',
    'u.s.a': 'united states',
    'uk': 'united kingdom'
})

# Split strings into multiple columns
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', n=1, expand=True)

# Check for patterns
df['is_corporation'] = df['company'].str.contains(r'inc|corp', case=False, regex=True)

* Stata: String manipulation

* Case standardization
replace country = lower(country)
replace state_code = upper(state_code)
replace name = proper(name)

* Remove whitespace
replace state = strtrim(state)
replace address = stritrim(address)

* Extract patterns with regex
gen zip_code = regexs(0) if regexm(address, "[0-9]{5}")

* Extract with capture groups
gen state_abbr = regexs(1) if regexm(address, "([A-Z]{2})\s+[0-9]{5}")

* Replace values for standardization
replace country = "united states" if inlist(country, "usa", "u.s.a", "u.s.")
replace country = "united kingdom" if country == "uk"

* Split strings into multiple variables
split full_name, parse(" ") gen(name_part)
rename name_part1 first_name
rename name_part2 last_name

* Check for patterns
gen is_corporation = regexm(lower(company), "inc|corp")

* Substring extraction
gen state_abbr2 = substr(state_full, 1, 2)

# R: String manipulation
library(tidyverse)  # includes stringr

# Case standardization
df <- df %>%
  mutate(country = str_to_lower(country))
df <- df %>%
  mutate(state_code = str_to_upper(state_code))
df <- df %>%
  mutate(name = str_to_title(name))

# Remove whitespace
df <- df %>%
  mutate(state = str_trim(state))
df <- df %>%
  mutate(address = str_squish(address))

# Extract patterns with regex
df <- df %>%
  mutate(zip_code = str_extract(address, "\\d{5}"))

# Extract with named capture groups
address_parts <- str_match(df$address, "([A-Z]{2})\\s+(\\d{5})")
df$state_abbr <- address_parts[, 2]
df$zip <- address_parts[, 3]

# Replace values for standardization
df <- df %>%
  mutate(country = case_when(
    country %in% c("usa", "u.s.a", "u.s.") ~ "united states",
    country == "uk" ~ "united kingdom",
    TRUE ~ country
  ))

# Split strings into multiple columns
df <- df %>%
  separate(full_name, into = c("first_name", "last_name"), sep = " ", extra = "merge")

# Check for patterns
df <- df %>%
  mutate(is_corporation = str_detect(company, regex("inc|corp", ignore_case = TRUE)))

Python Output

# Example DataFrame after string cleaning country name first_name last_name is_corporation 0 united states John Doe John Doe True 1 united kingdom Jane Smith Jane Smith False 2 united states Bob Wilson Bob Wilson True

Stata Output

. list country name first_name last_name is_corporation in 1/3 +----------------------------------------------------------------+ | country name first_name last_name is_corp | |----------------------------------------------------------------| 1. | united states John Doe John Doe 1 | 2. | united kingdom Jane Smith Jane Smith 0 | 3. | united states Bob Wilson Bob Wilson 1 | +----------------------------------------------------------------+

R Output

# A tibble: 3 × 5 country name first_name last_name is_corporation <chr> <chr> <chr> <chr> <lgl> 1 united states John Doe John Doe TRUE 2 united kingdom Jane Smith Jane Smith FALSE 3 united states Bob Wilson Bob Wilson TRUE

4.4 Working with Dates

Date handling varies across languages but follows similar principles: parsing string dates, extracting components, and performing date arithmetic. Properly formatted dates are essential for time series analysis, event studies, and panel data methods.

# Python: Working with dates
import pandas as pd

# Parse string dates to datetime
df['date'] = pd.to_datetime(df['date_string'])

# Specify format for non-standard dates
df['date'] = pd.to_datetime(df['date_string'], format='%d/%m/%Y')

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['day_of_week'] = df['date'].dt.day_name()

# Date arithmetic
df['days_since_hire'] = (pd.Timestamp.today() - df['hire_date']).dt.days
df['deadline'] = df['start_date'] + pd.Timedelta(days=30)

# Filter by date range
df_recent = df[df['date'] >= '2020-01-01']
df_2023 = df[df['date'].between('2023-01-01', '2023-12-31')]

# Create period indicators
df['year_quarter'] = df['date'].dt.to_period('Q')

# Handle missing or invalid dates
df['date'] = pd.to_datetime(df['date_string'], errors='coerce')
print(f"Invalid dates: {df['date'].isna().sum()}")

* Stata: Working with dates

* Parse string dates to Stata date format
gen date = date(date_string, "DMY")
format date %td

* Different date formats
gen date_ymd = date(date_string, "YMD")
gen date_mdy = date(date_string, "MDY")

* Extract date components
gen year = year(date)
gen month = month(date)
gen quarter = quarter(date)
gen day_of_week = dow(date)

* Date arithmetic (Stata dates are just numbers)
gen days_since_hire = today() - hire_date
gen deadline = start_date + 30

* Create period indicators
gen year_quarter = yq(year, quarter)
format year_quarter %tq
gen year_month = ym(year, month)
format year_month %tm

* Filter by date range
keep if date >= mdy(1, 1, 2020)

* Handle datetime (with time component)
gen double datetime = clock(datetime_string, "YMDhms")
format datetime %tc

# R: Working with dates
library(tidyverse)
library(lubridate)  # Part of tidyverse, explicit load for clarity

# Parse string dates to Date objects
df <- df %>%
  mutate(date = ymd(date_string))

# Different date formats
df <- df %>%
  mutate(date_euro = dmy(date_string_euro))
df <- df %>%
  mutate(date_us = mdy(date_string_us))

# Extract date components
df <- df %>%
  mutate(
    year = year(date),
    month = month(date),
    quarter = quarter(date),
    day_of_week = wday(date, label = TRUE)
  )

# Date arithmetic
df <- df %>%
  mutate(days_since_hire = as.numeric(difftime(Sys.Date(), hire_date, units = "days")))
df <- df %>%
  mutate(deadline = start_date + days(30))

# Filter by date range
df_recent <- df %>%
  filter(date >= ymd("2020-01-01"))
df_2023 <- df %>%
  filter(between(date, ymd("2023-01-01"), ymd("2023-12-31")))

# Create period indicators
df <- df %>%
  mutate(quarter_start = floor_date(date, "quarter"))

# Handle parsing failures
df <- df %>%
  mutate(date = ymd(date_string, quiet = TRUE))
cat("Invalid dates:", sum(is.na(df$date)), "\n")

Python Output

# Date columns after processing date year month quarter day_of_week days_since_hire deadline 0 2024-01-15 2024 1 1 Monday 365 2024-02-14 1 2024-03-22 2024 3 1 Friday 299 2024-04-21 2 2023-12-01 2023 12 4 Friday 421 2023-12-31 Invalid dates: 3

Stata Output

. list date year month quarter day_of_week in 1/3 +--------------------------------------------+ | date year month quarter dow | |--------------------------------------------| 1. | 15jan2024 2024 1 1 1 | 2. | 22mar2024 2024 3 1 5 | 3. | 01dec2023 2023 12 4 5 | +--------------------------------------------+

R Output

# A tibble: 3 × 6 date year month quarter day_of_week days_since_hire <date> <dbl> <dbl> <int> <ord> <dbl> 1 2024-01-15 2024 1 1 Mon 365 2 2024-03-22 2024 3 1 Fri 299 3 2023-12-01 2023 12 4 Fri 421 Invalid dates: 3

Common Date Format Codes

Python/R: %Y = 4-digit year, %m = month (01-12), %d = day (01-31), %H = hour, %M = minute, %S = second
Stata: Y = year, M = month, D = day, h = hour, m = minute, s = second
Example: "2024-01-15" = %Y-%m-%d (Python/R) or "YMD" (Stata)

4.5 Data Validation

Before analysis, always validate your data for consistency and logical errors. Validation catches problems that summary statistics might miss: impossible values, logical inconsistencies, and data entry errors. Build validation checks into your workflow and run them after every data transformation.

# Python: Data validation
import pandas as pd

# Check value ranges (impossible values)
assert (df['age'] >= 0).all() & (df['age'] <= 120).all(), "Invalid ages found"
assert (df['income'] >= 0).all(), "Negative income values found"

# Find and report violations (without stopping)
invalid_age = (df['age'] < 0) | (df['age'] > 120)
print(f"Rows with invalid age: {invalid_age.sum()}")
print(df[invalid_age])  # Show the problematic rows

# Check categorical values (unexpected categories)
valid_genders = ['M', 'F', 'Other']
invalid_gender = ~df['gender'].isin(valid_genders)
print(f"Invalid gender values: {df.loc[invalid_gender, 'gender'].unique()}")

# Check logical consistency (cross-variable rules)
inconsistent = df['retirement_year'] < df['birth_year']
print(f"Retired before birth: {inconsistent.sum()} rows")

# Check for duplicates
duplicates = df.duplicated(subset=['person_id'], keep=False)
print(f"Duplicate IDs: {duplicates.sum()}")

# Validation function for reuse
def validate_data(df):
    issues = []
    if (df['age'] < 0).any():
        issues.append(f"Negative ages: {(df['age'] < 0).sum()}")
    if (df['income'] < 0).any():
        issues.append(f"Negative incomes: {(df['income'] < 0).sum()}")
    if df.duplicated(subset=['person_id']).any():
        issues.append(f"Duplicate IDs found")
    if issues:
        print("Validation failed:\n" + "\n".join(issues))
        return False
    print("All validation checks passed")
    return True

validate_data(df)

* Stata: Data validation

* Check value ranges (impossible values)
assert age >= 0 & age <= 120
assert income >= 0

* Find and report violations (without stopping)
count if age < 0 | age > 120
list if age < 0 | age > 120

* Check categorical values
count if !inlist(gender, "M", "F", "Other")
tab gender if !inlist(gender, "M", "F", "Other")  // Show invalid values

* Check logical consistency
count if retirement_year < birth_year
list person_id birth_year retirement_year if retirement_year < birth_year

* Check for duplicates
duplicates report person_id
duplicates list person_id

* Create validation summary
capture program drop validate_data
program define validate_data
    display "Running validation checks..."
    quietly count if age < 0 | age > 120
    if r(N) > 0 display "WARNING: " r(N) " invalid age values"
    quietly count if income < 0
    if r(N) > 0 display "WARNING: " r(N) " negative incomes"
    quietly duplicates report person_id
    display "Validation complete"
end

validate_data

# R: Data validation
library(tidyverse)

# Check value ranges (impossible values)
stopifnot(all(df$age >= 0 & df$age <= 120, na.rm = TRUE))
stopifnot(all(df$income >= 0, na.rm = TRUE))

# Find and report violations (without stopping)
invalid_age <- df %>%
  filter(age < 0 | age > 120)
cat("Rows with invalid age:", nrow(invalid_age), "\n")
print(invalid_age)

# Check categorical values
valid_genders <- c("M", "F", "Other")
invalid_gender <- df %>%
  filter(!gender %in% valid_genders)
cat("Invalid gender values:", unique(invalid_gender$gender), "\n")

# Check logical consistency
inconsistent <- df %>%
  filter(retirement_year < birth_year)
cat("Retired before birth:", nrow(inconsistent), "rows\n")

# Check for duplicates
duplicates <- df %>%
  group_by(person_id) %>%
  filter(n() > 1) %>%
  ungroup()
cat("Duplicate IDs:", nrow(duplicates), "\n")

# Validation function for reuse
validate_data <- function(df) {
  issues <- c()
  if (any(df$age < 0, na.rm = TRUE)) {
    issues <- c(issues, paste("Negative ages:", sum(df$age < 0, na.rm = TRUE)))
  }
  if (any(df$income < 0, na.rm = TRUE)) {
    issues <- c(issues, paste("Negative incomes:", sum(df$income < 0, na.rm = TRUE)))
  }
  if (any(duplicated(df$person_id))) {
    issues <- c(issues, "Duplicate IDs found")
  }
  if (length(issues) > 0) {
    cat("Validation failed:\n", paste(issues, collapse = "\n"), "\n")
    return(FALSE)
  }
  cat("All validation checks passed\n")
  return(TRUE)
}

validate_data(df)

Python Output

Invalid ages (outside 0-120): 2 person_id age 47 A0048 -999 203 B0204 150 Negative incomes: 0 Invalid gender values: 3 gender count m 1 F. 1 male 1 Retired before birth: 1 rows Duplicate person_ids: 4 Validation failed: Invalid ages found Duplicate IDs found

Stata Output

. count if age < 0 | age > 120 2 . count if !inlist(gender, "M", "F", "Other") 3 . duplicates report person_id Duplicates in terms of person_id -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 996 0 2 | 4 2 -------------------------------------- . validate_data Running validation checks... WARNING: 2 invalid age values Validation complete

R Output

Invalid ages (outside 0-120): 2 # A tibble: 2 × 2 person_id age <chr> <dbl> 1 A0048 -999 2 B0204 150 Invalid gender values: 3 Retired before birth: 1 rows Duplicate IDs: 4 Validation failed: Invalid ages found Duplicate IDs found

Validation Best Practices

1. Validate early and often — run checks after loading data and after each transformation.
2. Document expected ranges — write down what values are valid for each variable.
3. Don't silently fix — log all corrections for reproducibility.
4. Check relationships — some errors only appear when comparing multiple variables.

Merge Diagnostics

When combining datasets, always verify the merge quality. Did all records match? Are there unexpected unmatched observations? Merge diagnostics help you catch data issues that would silently bias your analysis — for example, if 20% of firms in your master dataset have no match in the financial data, your analysis sample may not be representative.

What to look for in the code

In Stata, every merge creates a _merge variable with values: 1 (master only), 2 (using only), 3 (matched). Always tab _merge to inspect match quality before deciding which observations to keep. In Python, pd.merge() with indicator=True creates a similar diagnostic column. In R, anti_join() finds unmatched rows.

# Python: Merge diagnostics
import pandas as pd

# Merge with indicator to track match quality
merged = pd.merge(firms, financials, on='firm_id',
                  how='outer', indicator=True)

# Inspect match quality
print(merged['_merge'].value_counts())

# Keep only matched observations
matched = merged[merged['_merge'] == 'both'].drop(columns=['_merge'])

* Stata: Merge diagnostics

* Load master data and merge
use firms.dta, clear
merge 1:1 firm_id using financials.dta

* ALWAYS inspect merge quality
tab _merge

* Keep only matched observations
keep if _merge == 3
drop _merge

# R: Merge diagnostics
library(dplyr)

# Inner join keeps only matches
matched <- inner_join(firms, financials, by = "firm_id")

# Find unmatched records using anti_join
firms_only <- anti_join(firms, financials, by = "firm_id")
financials_only <- anti_join(financials, firms, by = "firm_id")

# Report match quality
cat("Matched:", nrow(matched), "\n")
cat("Firms only:", nrow(firms_only), "\n")
cat("Financials only:", nrow(financials_only), "\n")

Python Output

>>> print(merged['_merge'].value_counts()) both 847 left_only 53 right_only 12 Name: _merge, dtype: int64

Stata Output

. merge 1:1 firm_id using financials.dta Result Number of obs ----------------------------------------- Not matched 65 from master 53 (_merge==1) from using 12 (_merge==2) Matched 847 (_merge==3) ----------------------------------------- . tab _merge _merge | Freq. Percent Cum. ------------------------+----------------------------------- Master only (1) | 53 5.81 5.81 Using only (2) | 12 1.32 7.13 Matched (3) | 847 92.87 100.00 ------------------------+----------------------------------- Total | 912 100.00

R Output

> cat("Matched:", nrow(matched), "\n") Matched: 847 > cat("Firms only:", nrow(firms_only), "\n") Firms only: 53 > cat("Financials only:", nrow(financials_only), "\n") Financials only: 12

4.6 Case Study: CPS Data Cleaning

The Current Population Survey (CPS) is a key data source for labor economics research. Let's walk through cleaning a typical CPS extract, applying all the techniques from this module.

About the CPS

The CPS is a monthly survey of about 60,000 households conducted by the Bureau of Labor Statistics. It provides employment, earnings, and demographic data. Access cleaned extracts from IPUMS CPS.

Data source: Sarah Flood et al. IPUMS CPS: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2023.

The code below demonstrates a realistic cleaning workflow: handling special missing value codes, constructing derived variables like hourly wages, and validating the cleaned data.

# Python: CPS Data Cleaning Pipeline
import pandas as pd
import numpy as np

# Load CPS data (assume downloaded from IPUMS)
cps = pd.read_csv('cps_extract.csv')

# Step 1: Handle CPS missing value codes
# IPUMS uses codes like 999999 or 9999999 for NIU/missing
missing_codes = [9999999, 9999998, 999999, 999998]
cps['incwage'] = cps['incwage'].replace(missing_codes, np.nan)
cps['uhrswork'] = cps['uhrswork'].replace([999, 997], np.nan)

# Step 2: Restrict sample (prime-age workers)
cps = cps[
    (cps['age'] >= 25) &
    (cps['age'] <= 54) &
    (cps['incwage'] > 0) &
    (cps['uhrswork'] > 0)
].copy()

# Step 3: Construct hourly wage
cps['hourly_wage'] = cps['incwage'] / (cps['uhrswork'] * 52)

# Step 4: Handle wage outliers
cps = cps[(cps['hourly_wage'] >= 3) & (cps['hourly_wage'] <= 500)]

# Step 5: Create demographic indicators
cps['female'] = (cps['sex'] == 2).astype(int)
cps['educ_cat'] = pd.cut(
    cps['educ'],
    bins=[0, 72, 73, 110, 999],
    labels=['Less than HS', 'High School', 'Some College', 'College+']
)

# Step 6: Create log wage (common in labor economics)
cps['ln_wage'] = np.log(cps['hourly_wage'])

# Step 7: Final validation
print(f"Final sample size: {len(cps):,}")
print(f"Missing wages: {cps['hourly_wage'].isna().sum()}")
print(cps[['hourly_wage', 'ln_wage', 'age']].describe())

* Stata: CPS Data Cleaning Pipeline

* Load CPS data (assume downloaded from IPUMS)
import delimited "cps_extract.csv", clear

* Step 1: Handle CPS missing value codes
* IPUMS uses codes like 999999 or 9999999 for NIU/missing
replace incwage = . if inlist(incwage, 9999999, 9999998, 999999, 999998)
replace uhrswork = . if inlist(uhrswork, 999, 997)

* Step 2: Restrict sample (prime-age workers)
keep if age >= 25 & age <= 54
keep if incwage > 0 & !missing(incwage)
keep if uhrswork > 0 & !missing(uhrswork)

* Step 3: Construct hourly wage
gen hourly_wage = incwage / (uhrswork * 52)
label variable hourly_wage "Hourly wage (annual earnings / annual hours)"

* Step 4: Handle wage outliers
drop if hourly_wage < 3 | hourly_wage > 500

* Step 5: Create demographic indicators
gen female = (sex == 2)
label variable female "Female indicator"
label define female_lbl 0 "Male" 1 "Female"
label values female female_lbl

gen educ_cat = .
replace educ_cat = 1 if educ < 73
replace educ_cat = 2 if educ == 73
replace educ_cat = 3 if educ >= 80 & educ < 111
replace educ_cat = 4 if educ >= 111
label define educ_lbl 1 "Less than HS" 2 "High School" 3 "Some College" 4 "College+"
label values educ_cat educ_lbl

* Step 6: Create log wage (common in labor economics)
gen ln_wage = ln(hourly_wage)
label variable ln_wage "Log hourly wage"

* Step 7: Final validation
count
summarize hourly_wage ln_wage age
misstable summarize hourly_wage ln_wage

# R: CPS Data Cleaning Pipeline
library(tidyverse)

# Load CPS data (assume downloaded from IPUMS)
cps <- read_csv("cps_extract.csv")

# Step 1: Handle CPS missing value codes
# IPUMS uses codes like 999999 or 9999999 for NIU/missing
cps <- cps %>%
  mutate(
    incwage = if_else(incwage %in% c(9999999, 9999998, 999999, 999998), NA_real_, incwage),
    uhrswork = if_else(uhrswork %in% c(999, 997), NA_real_, uhrswork)
  )

# Step 2: Restrict sample (prime-age workers)
cps <- cps %>%
  filter(
    age >= 25 & age <= 54,
    incwage > 0,
    uhrswork > 0
  )

# Step 3: Construct hourly wage
cps <- cps %>%
  mutate(hourly_wage = incwage / (uhrswork * 52))

# Step 4: Handle wage outliers
cps <- cps %>%
  filter(hourly_wage >= 3 & hourly_wage <= 500)

# Step 5: Create demographic indicators
cps <- cps %>%
  mutate(female = as.integer(sex == 2))

cps <- cps %>%
  mutate(
    educ_cat = case_when(
      educ < 73 ~ "Less than HS",
      educ == 73 ~ "High School",
      educ >= 80 & educ < 111 ~ "Some College",
      educ >= 111 ~ "College+"
    ) %>% factor(levels = c("Less than HS", "High School", "Some College", "College+"))
  )

# Step 6: Create log wage (common in labor economics)
cps <- cps %>%
  mutate(ln_wage = log(hourly_wage))

# Step 7: Final validation
cat("Final sample size:", nrow(cps), "\n")
cat("Missing wages:", sum(is.na(cps$hourly_wage)), "\n")
cps %>%
  select(hourly_wage, ln_wage, age) %>%
  summary()

Python Output

Final sample size: 45,823 Missing wages: 0 hourly_wage ln_wage age count 45823.00 45823.00 45823.00 mean 28.45 3.14 39.72 std 22.67 0.68 8.41 min 3.00 1.10 25.00 25% 14.42 2.67 32.00 50% 22.12 3.10 40.00 75% 35.58 3.57 48.00 max 480.77 6.18 54.00

Stata Output

. count 45,823 . summarize hourly_wage ln_wage age Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- hourly_wage | 45,823 28.45312 22.66934 3 480.77 ln_wage | 45,823 3.140215 .6782031 1.098612 6.175867 age | 45,823 39.72148 8.413561 25 54 . misstable summarize hourly_wage ln_wage Obs<. +--------- | Unique Variable | Obs=. Obs>. values Min Max -------------+------------------------------------------------ | (no missing values found)

R Output

Final sample size: 45823 Missing wages: 0 hourly_wage ln_wage age Min. : 3.00 Min. :1.099 Min. :25.00 1st Qu.: 14.42 1st Qu.:2.669 1st Qu.:32.00 Median : 22.12 Median :3.096 Median :40.00 Mean : 28.45 Mean :3.140 Mean :39.72 3rd Qu.: 35.58 3rd Qu.:3.572 3rd Qu.:48.00 Max. :480.77 Max. :6.176 Max. :54.00

4.7 Exercises

Exercise 4.1: Handling Missing Data

Practice identifying and handling missing values. Complete these tasks:

Check for missing values in a dataset
Drop or fill missing values
Create a missing indicator variable

Solution

Step 0 of 3

Task 1: Check for missing values

import pandas as pd

# Check for missing values
print(df.isnull().sum())          # Count by column
print(df.isnull().mean() * 100)  # Percentage missing

Task 2: Handle missing values

# Drop rows with missing values
df_clean = df.dropna()

# Or fill with mean/median
df['income'].fillna(df['income'].mean(), inplace=True)

Task 3: Create a missing indicator

# Create indicator for missing values
df['income_missing'] = df['income'].isnull().astype(int)

Task 1: Check for missing values

* Check for missing values
misstable summarize
misstable patterns

Task 2: Handle missing values

* Drop observations with missing values
drop if missing(income)

* Or fill with mean
summarize income
replace income = r(mean) if missing(income)

Task 3: Create a missing indicator

* Create missing indicator
gen income_missing = missing(income)

Task 1: Check for missing values

library(tidyverse)

# Check for missing values
sum(is.na(df))                 # Total missing
colSums(is.na(df))             # By column

Task 2: Handle missing values

# Drop missing values
df_clean <- df %>% drop_na()

# Or fill with mean
df <- df %>% mutate(income = replace_na(income, mean(income, na.rm = TRUE)))

Task 3: Create a missing indicator

# Create missing indicator
df <- df %>% mutate(income_missing = as.integer(is.na(income)))

Exercise 4.2: Detecting Outliers

Practice identifying outliers using the z-score method. Complete these tasks:

Calculate z-scores for a numeric variable
Flag observations where |z| > 3 as outliers
Remove or winsorize the outliers

Solution

Step 0 of 3

Task 1: Calculate z-scores

from scipy import stats
import numpy as np

# Calculate z-scores
z_scores = np.abs(stats.zscore(df['income']))

Task 2: Flag outliers

# Flag observations where |z| > 3
df['is_outlier'] = (z_scores > 3).astype(int)
print(f"Outliers found: {df['is_outlier'].sum()}")

Task 3: Remove or winsorize outliers

# Remove outliers
df_clean = df[z_scores < 3]

# Or winsorize (cap at bounds)
Q1, Q3 = df['income'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df['income_winsorized'] = df['income'].clip(Q1 - 1.5*IQR, Q3 + 1.5*IQR)

Task 1: Calculate z-scores

* Calculate z-scores
summarize income
gen z_income = (income - r(mean)) / r(sd)

Task 2: Flag outliers

* Flag observations where |z| > 3
gen is_outlier = (abs(z_income) > 3)
count if is_outlier == 1

Task 3: Remove or winsorize outliers

* Remove outliers
drop if is_outlier == 1

* Or winsorize using IQR
summarize income, detail
gen income_winsorized = income
replace income_winsorized = r(p25) - 1.5*(r(p75)-r(p25)) if income < r(p25) - 1.5*(r(p75)-r(p25))
replace income_winsorized = r(p75) + 1.5*(r(p75)-r(p25)) if income > r(p75) + 1.5*(r(p75)-r(p25))

Task 1: Calculate z-scores

library(tidyverse)

# Calculate z-scores
df <- df %>%
  mutate(z_income = scale(income))

Task 2: Flag outliers

# Flag observations where |z| > 3
df <- df %>%
  mutate(is_outlier = as.integer(abs(z_income) > 3))
cat("Outliers found:", sum(df$is_outlier))

Task 3: Remove or winsorize outliers

# Remove outliers
df_clean <- df %>% filter(abs(z_income) < 3)

# Or winsorize using IQR
Q1 <- quantile(df$income, 0.25)
Q3 <- quantile(df$income, 0.75)
IQR_val <- Q3 - Q1
df <- df %>%
  mutate(income_winsorized = pmin(pmax(income, Q1 - 1.5*IQR_val), Q3 + 1.5*IQR_val))

Exercise 4.3: Cleaning String Data

Practice standardizing messy text variables. Complete these tasks:

Standardize case (uppercase or lowercase)
Remove extra whitespace
Extract a pattern using regex (e.g., zip code)

Solution

Step 0 of 3

Task 1: Standardize case

# Convert to lowercase
df['country'] = df['country'].str.lower()

# Or convert to uppercase
df['state_code'] = df['state_code'].str.upper()

Task 2: Remove whitespace

# Strip leading/trailing spaces
df['name'] = df['name'].str.strip()

# Collapse multiple spaces to one
df['address'] = df['address'].str.replace(r'\s+', ' ', regex=True)

Task 3: Extract pattern with regex

# Extract 5-digit zip code
df['zip_code'] = df['address'].str.extract(r'(\d{5})')

Task 1: Standardize case

* Convert to lowercase
replace country = lower(country)

* Or convert to uppercase
replace state_code = upper(state_code)

Task 2: Remove whitespace

* Strip leading/trailing spaces
replace name = strtrim(name)

* Collapse multiple internal spaces
replace address = stritrim(address)

Task 3: Extract pattern with regex

* Extract 5-digit zip code
gen zip_code = regexs(0) if regexm(address, "[0-9]{5}")

Task 1: Standardize case

library(tidyverse)

# Convert to lowercase
df <- df %>%
  mutate(country = str_to_lower(country))

# Or convert to uppercase
df <- df %>%
  mutate(state_code = str_to_upper(state_code))

Task 2: Remove whitespace

# Strip leading/trailing and collapse internal spaces
df <- df %>%
  mutate(
    name = str_trim(name),
    address = str_squish(address)
  )

Task 3: Extract pattern with regex

# Extract 5-digit zip code
df <- df %>%
  mutate(zip_code = str_extract(address, "\\d{5}"))

Exercise 4.4: Working with Dates

Practice parsing and manipulating date variables. Complete these tasks:

Parse a date string into a date object
Extract year, month, or quarter from a date
Calculate time difference between dates

Solution

Step 0 of 3

Task 1: Parse date string

import pandas as pd

# Parse string to datetime
df['date'] = pd.to_datetime(df['date_string'])

Task 2: Extract date components

# Extract year, month, quarter
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

Task 3: Calculate time difference

# Calculate days between dates
df['days_elapsed'] = (pd.Timestamp.today() - df['start_date']).dt.days

# Or between two columns
df['duration'] = (df['end_date'] - df['start_date']).dt.days

Task 1: Parse date string

* Parse string to Stata date
gen date = date(date_string, "YMD")
format date %td

Task 2: Extract date components

* Extract year, month, quarter
gen year = year(date)
gen month = month(date)
gen quarter = quarter(date)

Task 3: Calculate time difference

* Calculate days between dates
gen days_elapsed = today() - start_date

* Or between two columns
gen duration = end_date - start_date

Task 1: Parse date string

library(tidyverse)
library(lubridate)

# Parse string to Date
df <- df %>%
  mutate(date = ymd(date_string))

Task 2: Extract date components

# Extract year, month, quarter
df <- df %>%
  mutate(
    year = year(date),
    month = month(date),
    quarter = quarter(date)
  )

Task 3: Calculate time difference

# Calculate days between dates
df <- df %>%
  mutate(
    days_elapsed = as.numeric(difftime(Sys.Date(), start_date, units = "days")),
    duration = as.numeric(difftime(end_date, start_date, units = "days"))
  )

Exercise 4.5: Data Validation

Practice validating data for consistency and errors. Complete these tasks:

Check for valid value ranges (e.g., age between 0-120)
Identify duplicate records
Check logical consistency between variables

Solution

Step 0 of 3

Task 1: Check value ranges

# Check if age is within valid range
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
print(f"Invalid ages: {len(invalid_age)}")

# Using assert for hard validation
assert (df['age'] >= 0).all(), "Negative ages found"

Task 2: Find duplicates

# Find duplicate records by ID
duplicates = df[df.duplicated(subset=['person_id'], keep=False)]
print(f"Duplicate records: {len(duplicates)}")

Task 3: Check logical consistency

# Check that end_date is after start_date
inconsistent = df[df['end_date'] < df['start_date']]
print(f"Logically inconsistent: {len(inconsistent)}")

# Check that retirement year is after birth year
invalid = df[df['retirement_year'] < df['birth_year']]

Task 1: Check value ranges

* Check if age is within valid range
count if age < 0 | age > 120
list if age < 0 | age > 120

* Using assert for hard validation
assert age >= 0 & age <= 120

Task 2: Find duplicates

* Find duplicate records by ID
duplicates report person_id
duplicates list person_id

Task 3: Check logical consistency

* Check that end_date is after start_date
count if end_date < start_date
list if end_date < start_date

* Check that retirement year is after birth year
count if retirement_year < birth_year

Task 1: Check value ranges

library(tidyverse)

# Check if age is within valid range
invalid_age <- df %>%
  filter(age < 0 | age > 120)
cat("Invalid ages:", nrow(invalid_age), "\n")

# Using stopifnot for hard validation
stopifnot(all(df$age >= 0, na.rm = TRUE))

Task 2: Find duplicates

# Find duplicate records by ID
duplicates <- df %>%
  group_by(person_id) %>%
  filter(n() > 1) %>%
  ungroup()
cat("Duplicate records:", nrow(duplicates), "\n")

Task 3: Check logical consistency

# Check that end_date is after start_date
inconsistent <- df %>%
  filter(end_date < start_date)
cat("Logically inconsistent:", nrow(inconsistent), "\n")

# Check that retirement year is after birth year
invalid <- df %>%
  filter(retirement_year < birth_year)

4 Data Cleaning

Learning Objectives

Table of Contents

4.1 Handling Missing Data

4.2 Detecting and Treating Outliers

4.3 String Manipulation

4.4 Working with Dates

4.5 Data Validation

Merge Diagnostics

4.6 Case Study: CPS Data Cleaning

4.7 Exercises

Exercise 4.1: Handling Missing Data

Solution

Exercise 4.2: Detecting Outliers

Solution

Exercise 4.3: Cleaning String Data

Solution

Exercise 4.4: Working with Dates

Solution

Exercise 4.5: Data Validation

Solution

ProTools ER1 Assistant

ProTools ER1

Course Modules

4 Data Cleaning

Learning Objectives

Table of Contents

4.1 Handling Missing Data

4.2 Detecting and Treating Outliers

4.3 String Manipulation

4.4 Working with Dates

4.5 Data Validation

Merge Diagnostics

4.6 Case Study: CPS Data Cleaning

4.7 Exercises

Exercise 4.1: Handling Missing Data

Solution

Exercise 4.2: Detecting Outliers

Solution

Exercise 4.3: Cleaning String Data

Solution

Exercise 4.4: Working with Dates

Solution

Exercise 4.5: Data Validation

Solution

ProTools ER1 Assistant