10A  Text Analysis Today

From Raw Text to Rigorous Results Python • Hands-On • Reproducible

Module 10 showed you how we got here. This module shows you what to do now. We walk through every major technique used in modern text analysis — from basic preprocessing to transformer-based models and LLM APIs — with production-ready Python code you can run today. A single running example (analyzing Federal Reserve communications) ties every technique together so you can see how each tool fits into a real research or business pipeline.

Learning Objectives

  • Set up a reproducible text analysis environment in Python
  • Preprocess text data rigorously (tokenize, clean, lemmatize)
  • Represent text as numbers using BoW, TF-IDF, and embeddings
  • Perform sentiment analysis with lexicons, fine-tuned models, and LLMs
  • Discover topics with LDA and BERTopic
  • Extract named entities from unstructured text
  • Classify documents without labeled training data (zero-shot)
  • Measure semantic similarity with sentence embeddings
  • Use LLM APIs (Claude, GPT) for text analysis at scale
  • Build a complete, end-to-end text analysis pipeline
  • Follow best practices for reproducibility in computational text analysis
Running Example

Throughout this module we analyze Federal Reserve FOMC statements — the short press releases issued after each Federal Open Market Committee meeting. These are ideal for teaching because they are: (a) publicly available, (b) consequential for financial markets, (c) widely studied in economics and finance, and (d) short enough to inspect manually while long enough to be interesting. We will build a dataset of FOMC statements and apply every technique in this module to them.

10A.0  The Text Analysis Pipeline

Every text analysis project — whether a hedge fund parsing earnings calls or a political scientist studying parliamentary speeches — follows the same core pipeline. The techniques differ, but the structure is universal:

AcquireGather raw text
PreprocessClean & tokenize
RepresentText → numbers
AnalyzeModel & extract
ValidateEvaluate & report

The single most consequential decision you make is at the Represent stage. The table below summarizes the three tiers of text representation available today, from simplest to most powerful:

Approach How It Works Strengths Limitations
Bag of Words / TF-IDF Count words (or weight by rarity) Interpretable, fast, great baselines No word order, no context
Static Embeddings
Word2Vec, GloVe
Dense vectors learned from co-occurrence Captures similarity, lightweight One vector per word (no polysemy)
Contextual Embeddings
BERT, Sentence-Transformers
Transformer encodes full sentence context State-of-the-art accuracy, handles ambiguity Compute-intensive, less interpretable
Key Insight: There Is No Single “Best” Method

A common mistake is jumping straight to transformers for every task. In practice, TF-IDF + logistic regression remains a remarkably strong baseline that is fast, interpretable, and often good enough. Start simple, measure performance, and only add complexity when it demonstrably helps. Many top-published papers in economics still use dictionary-based or TF-IDF approaches because interpretability matters for peer review.

Business Applications

  • Customer feedback & review analysis
  • Earnings call sentiment scoring
  • Social media brand monitoring
  • Resume screening & job matching
  • Contract & compliance analysis
  • News-based trading signals

Research Applications

  • Central bank communication analysis
  • Political speech & ideology scaling
  • Media bias detection
  • Economic Policy Uncertainty indices
  • Patent & innovation measurement
  • Historical text analysis

10A.1  Environment Setup

Before writing any analysis code, set up a reproducible environment. Pin your library versions so that anyone can replicate your results months or years later.

Installation

# requirements.txt — pin EXACT versions for reproducibility
# Save this file and run: pip install -r requirements.txt

# Core NLP
nltk==3.9.1
spacy==3.8.4

# Classical ML & text representation
scikit-learn==1.6.1
gensim==4.3.3

# Transformers & embeddings
transformers==4.47.1
sentence-transformers==3.4.1
torch==2.5.1

# Topic modeling
bertopic==0.16.4

# Financial sentiment
pysentiment2==0.1.1

# LLM APIs
anthropic==0.42.0
openai==1.58.1

# Data & visualization
pandas==2.2.3
matplotlib==3.9.3
seaborn==0.13.2
wordcloud==1.9.4
# After installing, download required models & data
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

# Download spaCy English model
# Run in terminal: python -m spacy download en_core_web_sm

Reproducibility Seed

Set random seeds once at the top of every script. This ensures that any method involving randomness (topic models, train/test splits, embeddings initialization) gives identical results every time.

import random
import numpy as np

def set_seed(seed=42):
    """Set all random seeds for full reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass

set_seed(42)

Loading Our Running Example

We will work with a small corpus of FOMC press statements. In practice, you would scrape these from the Federal Reserve website. For this tutorial, we define a sample directly:

import pandas as pd

# Sample FOMC statements (abbreviated for pedagogy)
fomc_data = [
    {
        "date": "2008-12-16",
        "text": """The Federal Open Market Committee decided today to establish a target range
for the federal funds rate of 0 to 1/4 percent. Since the Committee's last meeting,
labor market conditions have deteriorated, and the available data indicate that
consumer spending, business investment, and industrial production have declined.
Financial markets remain quite strained and credit conditions tight. Overall, the
outlook for economic activity has weakened further."""
    },
    {
        "date": "2015-12-16",
        "text": """The Committee judges that there has been considerable improvement in labor
market conditions this year, and it is reasonably confident that inflation will
rise, over the medium term, to its 2 percent objective. Given the economic outlook,
the Committee decided to raise the target range for the federal funds rate to
1/4 to 1/2 percent. The stance of monetary policy remains accommodative after
this increase, thereby supporting further improvement in labor market conditions."""
    },
    {
        "date": "2020-03-15",
        "text": """The coronavirus outbreak has harmed communities and disrupted economic activity
in many countries, including the United States. The effects of the coronavirus
will weigh on economic activity in the near term and pose risks to the economic
outlook. In light of these developments, the Committee decided to lower the target
range for the federal funds rate to 0 to 1/4 percent. The Committee expects to
maintain this target range until it is confident that the economy has weathered
recent events."""
    },
    {
        "date": "2022-06-15",
        "text": """Overall economic activity appears to have picked up after edging down in the
first quarter. Job gains have been robust in recent months, and the unemployment
rate has remained low. Inflation remains elevated, reflecting supply and demand
imbalances related to the pandemic, higher energy prices, and broader price
pressures. The Committee decided to raise the target range for the federal funds
rate to 1-1/2 to 1-3/4 percent and anticipates that ongoing increases in the
target range will be appropriate."""
    },
    {
        "date": "2024-09-18",
        "text": """Recent indicators suggest that economic activity has continued to expand at a
solid pace. Job gains have slowed but remain solid, and the unemployment rate has
moved up but remains low. Inflation has made further progress toward the Committee's
2 percent objective but remains somewhat elevated. The Committee decided to lower
the target range for the federal funds rate by 1/2 percentage point to 4-3/4 to
5 percent. The Committee has gained greater confidence that inflation is moving
sustainably toward 2 percent."""
    }
]

df = pd.DataFrame(fomc_data)
df["date"] = pd.to_datetime(df["date"])
print(f"Corpus: {len(df)} FOMC statements, {df['date'].dt.year.min()}–{df['date'].dt.year.max()}")
Output
Corpus: 5 FOMC statements, 2008–2024

10A.2  Text Preprocessing

Raw text is messy. Before any analysis, you must normalize it into a consistent format. The goal is to reduce noise without destroying signal. Every preprocessing choice is a research decision that should be documented and justified.

The Intuition: Why Preprocess at All?

Consider the sentence “The Fed’s rates are RISING!”. To a human, the meaning is obvious. To a computer, it is a string of characters that could match or not match other strings in unpredictable ways. "Rising""rising""RISING""rise" — yet they all convey the same concept. Preprocessing systematically resolves these surface-level differences so the algorithm can focus on meaning.

There are four core preprocessing steps, each motivated by a specific problem. Let’s walk through them on a real sentence from our FOMC corpus:

Preprocessing Pipeline: Step by Step
Raw text
Financial markets remain quite strained and credit conditions tight.
Tokenization: split into individual words
Tokenize
financial markets remain quite strained and credit conditions tight .
Lowercasing + remove punctuation
Normalize
financial markets remain quite strained and credit conditions tight
Remove stopwords: “quite,” “and” carry little meaning
Stopwords
financial markets remain strained credit conditions tight
Lemmatize: reduce to dictionary base form
Lemmatize
financial market remain strain credit condition tight

Each Step Explained

Tokenization splits continuous text into discrete units (tokens). This sounds trivial but is surprisingly subtle. Should “New York” be one token or two? What about “don’t” — “do” + “n’t”, or “don’t” as one unit? Tokenizers make these decisions using language-specific rules. Modern subword tokenizers (BPE, WordPiece) split rare words into meaningful pieces: “unemployment” → “un” + “employ” + “ment” (Manning, Raghavan & Schütze, 2008, Ch. 2).

Stopword removal discards high-frequency function words (“the,” “is,” “and”) that appear in virtually every document and therefore carry little discriminative information. Standard stopword lists contain 100–300 words. Caution: in some applications (e.g., authorship attribution), function word frequencies are themselves the signal (Mosteller & Wallace, 1964).

Stemming

Chops off word endings with crude rules.
running → run
studies → studi (not a real word!)
better → better (missed)

Fast but imprecise. The Porter Stemmer (1980) is the classic algorithm.

vs.

Lemmatization

Uses dictionary + grammar to find the base form.
running → run
studies → study (correct!)
better → good (understands irregulars)

Slower but linguistically accurate. Requires POS tagging.

Which to Use?

For research and any application where precision matters, always use lemmatization. Stemming is an artifact of a time when computational resources were limited. With modern libraries (spaCy processes ~10,000 documents/minute), the performance cost of lemmatization is negligible (Balakrishnan & Lloyd-Yemoh, 2014; Jurafsky & Martin, 2024, Ch. 2).

Preprocessing Is Not Neutral

Every preprocessing step discards information. Removing stopwords eliminates frequency signals. Lemmatization collapses distinctions (bettergood). Lowercasing merges proper nouns with common words. There is no universally “correct” pipeline — the right choices depend on your research question. Always document what you did and why.

Option A: NLTK (transparent, step-by-step)

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Step 1: Tokenize — split text into individual words
text = df.loc[0, "text"]
tokens = word_tokenize(text.lower())
print(f"Raw tokens ({len(tokens)}): {tokens[:10]}")

# Step 2: Remove stopwords and punctuation
stop_words = set(stopwords.words("english"))
filtered = [t for t in tokens if t.isalpha() and t not in stop_words]
print(f"After filtering ({len(filtered)}): {filtered[:10]}")

# Step 3: Lemmatize — reduce words to base form
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(t, pos="v") for t in filtered]
print(f"After lemmatization ({len(lemmatized)}): {lemmatized[:10]}")
Output
Raw tokens (68): ['the', 'federal', 'open', 'market', 'committee', 'decided', 'today', 'to', 'establish', 'a'] After filtering (34): ['federal', 'open', 'market', 'committee', 'decided', 'today', 'establish', 'target', 'range', 'federal'] After lemmatization (34): ['federal', 'open', 'market', 'committee', 'decide', 'today', 'establish', 'target', 'range', 'federal']

Option B: spaCy (production-grade, faster)

import spacy

# Load the English pipeline (tokenizer + POS tagger + lemmatizer + NER)
nlp = spacy.load("en_core_web_sm")

def preprocess_spacy(text):
    """Preprocess a single document using spaCy.

    spaCy handles tokenization, POS tagging, and lemmatization
    in a single pass — faster and more linguistically accurate
    than doing each step separately.
    """
    doc = nlp(text)
    return [
        token.lemma_.lower()
        for token in doc
        if not token.is_stop       # remove stopwords
        and not token.is_punct     # remove punctuation
        and token.is_alpha         # keep only alphabetic tokens
        and len(token) > 1        # drop single characters
    ]

# Apply to entire corpus
df["tokens"] = df["text"].apply(preprocess_spacy)
df["clean_text"] = df["tokens"].apply(lambda t: " ".join(t))

# Inspect
for _, row in df.iterrows():
    print(f"{row['date'].strftime('%Y-%m')}: {row['tokens'][:8]}...")
Output
2008-12: ['federal', 'open', 'market', 'committee', 'decide', 'today', 'establish', 'target']... 2015-12: ['committee', 'judge', 'considerable', 'improvement', 'labor', 'market', 'condition', 'year']... 2020-03: ['coronavirus', 'outbreak', 'harm', 'community', 'disrupt', 'economic', 'activity', 'country']... 2022-06: ['overall', 'economic', 'activity', 'appear', 'pick', 'edge', 'quarter', 'job']... 2024-09: ['recent', 'indicator', 'suggest', 'economic', 'activity', 'continue', 'expand', 'solid']...
NLTK vs. spaCy: When to Use Which

Use NLTK when teaching, prototyping, or when you need access to specific lexical resources (WordNet, VADER, specialized corpora). Use spaCy for production pipelines, large corpora, or when you need NER/POS tagging integrated into preprocessing. spaCy is typically 5–10x faster.

10A.3  Text Representation: BoW & TF-IDF

Machine learning algorithms operate on numbers, not words. Text representation converts documents into numerical vectors. This is the foundational challenge of all computational text analysis: how do you turn language — something inherently symbolic and contextual — into a mathematical object you can compute with?

The Core Problem

Think of it this way: if you have a spreadsheet where each row is an observation and each column is a variable, you can immediately run a regression. But with text, your “data” is a column of paragraphs. You need to transform those paragraphs into rows of numbers — but which numbers? The answer to this question has driven 30 years of NLP research. We start with the simplest answer.

Bag of Words (BoW)

The simplest representation: count how many times each word appears in each document. Ignore word order entirely. Treat the document as a “bag” of words thrown together — as if you shook all the words out of their sentences and just counted them.

Here is a tiny worked example with three documents:

Building a Document-Term Matrix by Hand
Doc 1
“the economy is growing”
Doc 2
“inflation is rising”
Doc 3
“the economy is slowing and inflation is rising”

After removing stopwords (“the,” “is,” “and”), we count each remaining word per document:

economy growing inflation rising slowing
Doc 1 1 1 0 0 0
Doc 2 0 0 1 1 0
Doc 3 1 0 1 1 1

Each document is now a vector of numbers: Doc 1 = [1, 1, 0, 0, 0], Doc 2 = [0, 0, 1, 1, 0]. You can now measure similarity between documents (using cosine similarity), cluster them, or feed them into a classifier. This is the document-term matrix — the foundational data structure of computational text analysis (Turney & Pantel, 2010).

The “Bag” Assumption

BoW deliberately ignores word order: “dog bites man” and “man bites dog” produce the same vector. This seems like a fatal flaw, yet BoW works remarkably well for many tasks because topic and sentiment are largely determined by which words are present, not their exact order. As Harris (1954) observed: “Words that occur in similar contexts tend to have similar meanings” — and BoW captures precisely which words co-occur in a document.

from sklearn.feature_extraction.text import CountVectorizer

# Build a Bag of Words matrix from our preprocessed texts
bow_vectorizer = CountVectorizer(
    max_features=100,       # keep top 100 words
    min_df=1,               # word must appear in at least 1 doc
    max_df=0.95             # ignore words in >95% of docs
)
bow_matrix = bow_vectorizer.fit_transform(df["clean_text"])

print(f"Document-term matrix shape: {bow_matrix.shape}")
print(f"Vocabulary size: {len(bow_vectorizer.get_feature_names_out())}")
print(f"Sample words: {bow_vectorizer.get_feature_names_out()[:15].tolist()}")
Output
Document-term matrix shape: (5, 88) Vocabulary size: 88 Sample words: ['accommodative', 'activity', 'anticipate', 'appear', 'appropriate', 'available', 'broader', 'business', 'committee', 'community']

TF-IDF: Weighting by Importance

Raw word counts treat every word equally. But a word like “committee” that appears in every FOMC statement is less informative than “coronavirus” which appears in only one. TF-IDF (Term Frequency – Inverse Document Frequency) solves this by asking a deceptively simple question: how distinctive is this word for this document?

The intuition has two parts:

  • Term Frequency (TF): How often does this word appear in this document? More occurrences → higher score.
  • Inverse Document Frequency (IDF): How rare is this word across all documents? Rarer words → higher score.

The product TF × IDF rewards words that are frequent locally but rare globally — exactly the words that characterize a document’s distinctive content.

TF-IDF Formula TF-IDF(t, d) = TF(t, d) × IDF(t)

where   TF(t, d) = count of term t in document d
       IDF(t) = log( N / DF(t) )
       N = total number of documents
       DF(t) = number of documents containing term t

Worked Example: Computing TF-IDF by Hand

Using our three mini-documents from above (N = 3 documents):

Word DF (docs containing it) IDF = log(3 / DF) Interpretation
economy 2 0.41 Appears in 2/3 docs — somewhat common
growing 1 1.10 Appears in only 1 doc — distinctive!
inflation 2 0.41 Appears in 2/3 docs — somewhat common
rising 2 0.41 Appears in 2/3 docs — somewhat common
slowing 1 1.10 Appears in only 1 doc — distinctive!

For Doc 3, the word “slowing” gets TF-IDF = 1 × 1.10 = 1.10, while “economy” gets only 1 × 0.41 = 0.41. TF-IDF automatically identifies “slowing” as the most informative word in Doc 3 — exactly what a human reader would conclude. This is why TF-IDF is often called a “statistical summary of what makes a document special” (Spärck Jones, 1972; Salton & Buckley, 1988).

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    max_features=100,
    ngram_range=(1, 2),    # unigrams AND bigrams
    min_df=1,
    max_df=0.95
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df["clean_text"])

# Show the most distinctive words for each statement
feature_names = tfidf_vectorizer.get_feature_names_out()

for i, row in df.iterrows():
    # Get TF-IDF scores for this document
    scores = tfidf_matrix[i].toarray().flatten()
    # Sort by score, take top 5
    top_idx = scores.argsort()[-5:][::-1]
    top_words = [(feature_names[j], round(scores[j], 3)) for j in top_idx]
    print(f"{row['date'].strftime('%Y-%m')}: {top_words}")
Output
2008-12: [('decline', 0.312), ('strained', 0.312), ('credit', 0.312), ('tight', 0.312), ('weaken', 0.312)] 2015-12: [('improvement', 0.339), ('accommodative', 0.267), ('confident', 0.267), ('considerable', 0.267), ('rise', 0.267)] 2020-03: [('coronavirus', 0.407), ('outbreak', 0.321), ('weather', 0.321), ('disrupt', 0.253), ('harm', 0.253)] 2022-06: [('inflation', 0.293), ('price', 0.280), ('energy', 0.231), ('imbalance', 0.231), ('pandemic', 0.231)] 2024-09: [('solid', 0.367), ('progress', 0.289), ('sustainably', 0.289), ('confidence', 0.228), ('expand', 0.228)]
Reading the Output

Notice how TF-IDF immediately surfaces the distinctive content of each statement: the 2008 crisis (“decline,” “strained,” “tight”), the 2020 pandemic (“coronavirus,” “outbreak”), the 2022 inflation spike (“inflation,” “price,” “energy”). This is precisely why TF-IDF remains a powerful first step: it tells you, at a glance, what makes each document unique.

10A.4  Sentiment Analysis

Sentiment analysis assigns a polarity score (positive, negative, neutral) to text. This is arguably the most widely used text analysis technique in both business and research — from hedge funds scoring earnings calls in real time (Loughran & McDonald, 2016) to political scientists measuring the tone of legislative debate (Gentzkow, Shapiro & Taddy, 2019).

The Fundamental Idea

At its core, sentiment analysis rests on a simple premise: certain words carry emotional valence. “Excellent” is positive. “Terrible” is negative. “Quarterly” is neutral. If you count the positive and negative words in a document, you get a rough measure of its overall tone. This is the lexicon-based approach — and it is where the field began.

But language is not that simple. Consider these three sentences:

Why Sentiment Is Hard: Context Matters
Negative?
“Unemployment has declined significantly.” ← Actually positive! Decline in a bad thing = good
Positive?
“Inflation rose to its highest level.” ← Actually negative! Rise in a bad thing = bad
Neutral?
“The company reported liability of $2B.” ← Neutral in finance! Not negative despite everyday connotation

These examples illustrate the three generations of sentiment analysis: (1) generic lexicons that miss domain context, (2) domain-specific dictionaries that handle financial vocabulary, and (3) contextual models (transformers) that understand how words interact in sentences.

Generation 1
Generic lexicons
(VADER, TextBlob)
Look up each word in a dictionary
Generation 2
Domain dictionaries
(Loughran-McDonald)
Finance-specific word lists
Generation 3
Contextual models
(FinBERT, LLMs)
Understand full sentence meaning

Approach 1: Lexicon-Based (VADER)

VADER (Valence Aware Dictionary and sEntiment Reasoner; Hutto & Gilbert, 2014) works by maintaining a dictionary of ~7,500 words, each hand-rated on a scale from −4 (most negative) to +4 (most positive) by human annotators. When you pass a sentence to VADER, it:

  1. Looks up each word in the dictionary
  2. Applies heuristic rules for intensifiers (“very good” > “good”), negations (“not good” flips sign), and capitalization (“GREAT” > “great”)
  3. Normalizes the aggregate score to a −1 to +1 range (the “compound” score)
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

print("VADER Sentiment Scores (compound: -1 = most negative, +1 = most positive)\n")
for _, row in df.iterrows():
    scores = sia.polarity_scores(row["text"])
    print(f"{row['date'].strftime('%Y-%m')}  compound={scores['compound']:+.3f}  "
          f"pos={scores['pos']:.2f}  neg={scores['neg']:.2f}  neu={scores['neu']:.2f}")
Output
VADER Sentiment Scores (compound: -1 = most negative, +1 = most positive) 2008-12 compound=-0.784 pos=0.04 neg=0.17 neu=0.79 2015-12 compound=+0.936 pos=0.18 neg=0.02 neu=0.80 2020-03 compound=-0.654 pos=0.05 neg=0.13 neu=0.82 2022-06 compound=+0.126 pos=0.08 neg=0.06 neu=0.86 2024-09 compound=+0.784 pos=0.14 neg=0.03 neu=0.83
VADER’s Limitation for Financial Text

VADER was designed for informal social media text. For financial and economic text, words have domain-specific meanings: “liability” is neutral in a balance sheet context, not negative; “decline” in earnings is negative, but “decline in unemployment” is positive. VADER does not capture these nuances. For finance, use domain-specific tools.

Approach 2: Loughran-McDonald Dictionary (Finance-Specific)

The breakthrough paper by Loughran & McDonald (2011, Journal of Finance) showed that nearly three-quarters of “negative” words in the Harvard General Inquirer dictionary are not negative in financial contexts. Words like “tax,” “cost,” “capital,” “liability,” and “risk” are classified as negative by general-purpose dictionaries but are routine, neutral vocabulary in financial texts.

To fix this, they manually classified >4,000 words from 10-K filings into six categories: Negative, Positive, Uncertainty, Litigious, Constraining, and Superfluous. Their dictionary is now the de facto standard for sentiment analysis in finance and accounting research, cited in over 8,000 papers.

Why Domain Dictionaries Matter: “Negative” Words That Aren’t
Harvard GI
tax liability cost risk capital foreign ← All “negative” in GI; all neutral in finance
Loughran-McD
decline loss impairment adverse default litigation ← Truly negative in financial context
import pysentiment2 as ps

# Load the Loughran-McDonald financial dictionary
lm = ps.LM()

print("Loughran-McDonald Financial Sentiment\n")
for _, row in df.iterrows():
    tokens = lm.tokenize(row["text"])
    score = lm.get_score(tokens)
    print(f"{row['date'].strftime('%Y-%m')}  Polarity={score['Polarity']:.3f}  "
          f"Positive={score['Positive']}  Negative={score['Negative']}")
Output
Loughran-McDonald Financial Sentiment 2008-12 Polarity=-0.667 Positive=1 Negative=5 2015-12 Polarity=+0.500 Positive=3 Negative=1 2020-03 Polarity=-0.600 Positive=1 Negative=4 2022-06 Polarity=-0.200 Positive=2 Negative=3 2024-09 Polarity=+0.500 Positive=3 Negative=1

Approach 3: FinBERT (Transformer-Based, State of the Art)

FinBERT (Araci, 2019) is a BERT model further pre-trained on 46,000 sentences from financial news and then fine-tuned on 4,845 manually labeled sentences. Unlike dictionaries, it does not look up individual words. Instead, it reads the entire sentence at once through 12 layers of self-attention (see Module 10, Section 10.6), building a rich representation of how all the words relate to each other. Only then does it output a classification.

Dictionary Approach

“Unemployment declined
↓ Look up “declined” → score = −2
↓ Result: NEGATIVE

Wrong! It ignores that “declined” modifies “unemployment”.

vs.

FinBERT Approach

“Unemployment declined”
↓ Attention: “declined” attends to “unemployment”
↓ Learns: decline of bad thing = good
↓ Result: POSITIVE

Correct! Context changes the meaning.

This contextual understanding is what makes transformer-based models fundamentally more capable than dictionary lookup, at the cost of being less interpretable and more computationally expensive (Huang, Wang & Yang, 2023).

from transformers import pipeline

# Load FinBERT — first run downloads the model (~420 MB)
finbert = pipeline(
    "sentiment-analysis",
    model="ProsusAI/finbert",
    tokenizer="ProsusAI/finbert"
)

print("FinBERT Financial Sentiment\n")
for _, row in df.iterrows():
    # FinBERT has a 512-token limit; truncate if needed
    result = finbert(row["text"][:512])
    print(f"{row['date'].strftime('%Y-%m')}  {result[0]['label']:<10s}  "
          f"confidence={result[0]['score']:.3f}")
Output
FinBERT Financial Sentiment 2008-12 negative confidence=0.962 2015-12 positive confidence=0.891 2020-03 negative confidence=0.943 2022-06 neutral confidence=0.674 2024-09 positive confidence=0.856
Comparing the Three Approaches

All three methods correctly identify the 2008 and 2020 crises as negative and the 2015 and 2024 statements as positive. But notice the 2022 statement about inflation: VADER rates it slightly positive (it doesn’t understand that “inflation remains elevated” is concerning), Loughran-McDonald rates it mildly negative (it counts negative words), and FinBERT rates it neutral with lower confidence (it understands the mixed signals). For research, use at least two methods and compare. For financial applications, FinBERT is the current standard.

10A.5  Topic Modeling

Topic modeling discovers the latent themes in a collection of documents without any prior labeling. It answers the question: “What is this corpus about?” This is unsupervised learning — the algorithm has no labels, no categories, no guidance. It must discover the themes on its own by finding patterns in which words tend to co-occur.

LDA (Latent Dirichlet Allocation)

LDA (Blei, Ng & Jordan, 2003) is one of the most influential models in machine learning. To understand it, think of a generative story — a fictional process by which documents could have been written. LDA doesn’t claim authors actually follow this process; it uses it as a mathematical model to reverse-engineer the topics.

Choose how many topics to mix. For each document, draw a distribution over topics from a Dirichlet distribution. For example, an FOMC statement might be 60% about “inflation” and 40% about “employment.”
Document “2022-06”: θ = [0.60 inflation, 0.33 employment, 0.07 crisis]
For each word in the document: first, randomly pick one of the topics according to the document’s topic mixture.
Roll dice → topic = “inflation” (with probability 0.60)
Then, randomly pick a word from that topic’s word distribution. Each topic is a probability distribution over the entire vocabulary.
Topic “inflation” → P(price)=0.08, P(inflation)=0.07, P(elevated)=0.05, … → word = “price”
Repeat for every word in every document. The actual algorithm works backwards: given the observed words, it infers the most likely topic mixtures (θ) and word distributions (β) that could have generated them. This is done via variational inference or Gibbs sampling (Griffiths & Steyvers, 2004).
Why the Generative Story Matters

The generative story is not just a metaphor — it is the model. Every assumption (topics are Dirichlet-distributed, words are drawn independently given a topic) has mathematical consequences. The Dirichlet prior encourages sparse mixtures: most documents are about a small number of topics, and most topics use a small number of words. This sparsity is what makes the discovered topics interpretable. If a document were equally about all topics, or a topic used all words equally, neither would be informative (Blei, 2012).

The Key Hyperparameter: How Many Topics?

LDA requires you to specify K, the number of topics, before fitting the model. This is both its strength (you have control) and its weakness (you might choose poorly). Common approaches: (1) try several values of K and compare coherence scores, (2) use domain knowledge (“I expect ~5 major policy themes”), or (3) use hierarchical models that learn K automatically. In practice, researchers often run K = 5, 10, 15, 20 and inspect the topics qualitatively (Chang et al., 2009).

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# LDA requires a raw count matrix (not TF-IDF)
count_vec = CountVectorizer(max_features=200, min_df=1, max_df=0.95)
dtm = count_vec.fit_transform(df["clean_text"])

# Fit LDA with 3 topics
lda = LatentDirichletAllocation(
    n_components=3,            # number of topics to discover
    random_state=42,           # reproducibility!
    max_iter=50,               # training iterations
    learning_method="online"   # faster for small corpora
)
lda.fit(dtm)

# Display the top words per topic
feature_names = count_vec.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-8:][::-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

# Show topic distribution for each document
print("\nDocument-topic distributions:")
doc_topics = lda.transform(dtm)
for i, row in df.iterrows():
    dist = [round(x, 2) for x in doc_topics[i]]
    print(f"  {row['date'].strftime('%Y-%m')}: {dist}")
Output
Topic 0: inflation, price, rate, range, target, economic, activity, fund Topic 1: economic, labor, market, condition, committee, rate, fund, target Topic 2: coronavirus, economy, committee, range, target, rate, fund, event Document-topic distributions: 2008-12: [0.07, 0.86, 0.07] 2015-12: [0.08, 0.84, 0.08] 2020-03: [0.07, 0.07, 0.86] 2022-06: [0.86, 0.07, 0.07] 2024-09: [0.56, 0.38, 0.06]

BERTopic (Neural Topic Modeling)

BERTopic (Grootendorst, 2022) takes a fundamentally different approach from LDA. Instead of modeling the generative process of documents, it uses a four-stage pipeline that leverages modern neural embeddings:

1. Embed
Convert each document to a dense vector using sentence-transformers
2. Reduce
Project high-dimensional vectors to 2D/5D using UMAP
3. Cluster
Group similar documents using HDBSCAN (auto-detects # clusters)
4. Represent
Extract topic words using c-TF-IDF (class-based TF-IDF)

The key innovation is c-TF-IDF: after clustering, BERTopic treats all documents in a cluster as one big “class document” and computes TF-IDF at the cluster level. This gives each cluster a ranked list of the words that are most distinctive to it — which become the topic labels. Because it uses neural embeddings for clustering, BERTopic captures semantic similarity (not just word co-occurrence), making it superior for short texts where word overlap is sparse.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# BERTopic pipeline: Embed → Reduce dimensions → Cluster → c-TF-IDF
# For a small corpus, we lower min_topic_size
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

topic_model = BERTopic(
    embedding_model=embedding_model,
    min_topic_size=2,          # allow small topics (for demo)
    nr_topics="auto",          # let the model decide how many
    verbose=False
)

topics, probs = topic_model.fit_transform(df["text"].tolist())

# Display discovered topics
topic_info = topic_model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]])

# For larger corpora, BERTopic also supports interactive visualizations:
# topic_model.visualize_topics()        # Inter-topic distance map
# topic_model.visualize_barchart()       # Top words per topic
# topic_model.visualize_over_time(...)   # Topic evolution
LDA vs. BERTopic: Decision Guide

Use LDA when you need a well-established method that reviewers trust, when you want to set the number of topics yourself, or when interpretability and simplicity matter most. LDA is standard in economics (Gentzkow, Kelly & Taddy, 2019).
Use BERTopic when you have short texts (tweets, headlines), when you want the model to discover the number of topics, or when you need state-of-the-art coherence. BERTopic is increasingly accepted in top venues.

10A.6  Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies proper nouns and specific references in text — people, organizations, dates, monetary values, locations. Think of NER as converting unstructured text into a structured database: from a paragraph of prose, you extract a table of who, what, where, and when (Nadeau & Sekine, 2007).

What NER Sees

Here is how a NER model annotates a sentence. Each colored span is a detected entity:

Federal ReserveORG Chair Jerome PowellPERSON announced on TuesdayDATE that the central bank would maintain interest rates. Goldman SachsORG analysts expect GDP growth of 2.3%PERCENT in the United StatesLOC for 2025DATE.

How NER Works: The BIO Tagging Scheme

Under the hood, NER models classify each individual token (word) using the BIO tagging scheme. Each token gets one of three labels: Beginning of an entity, Inside an entity (continuation), or Outside any entity (not part of one). This allows multi-word entities like “Federal Reserve” to be captured as a single unit:

Token FederalReserveChair JeromePowellannounced
BIO tag B-ORGI-ORGO B-PERI-PERO

Modern NER models (spaCy, HuggingFace) are typically fine-tuned BERT models that predict BIO tags for each token simultaneously. The model sees the full sentence context, so it can distinguish “Apple” (the company) from “apple” (the fruit) based on surrounding words (Li et al., 2020).

Entity Types: What Can NER Extract?

TypeCodeExamplesResearch use
PersonPERJerome Powell, Christine LagardeWho is driving policy?
OrganizationORGFederal Reserve, Goldman Sachs, IMFInstitutional networks
LocationLOC/GPEUnited States, Europe, ChinaGeographic focus of policy
DateDATETuesday, Q3 2024, MarchTemporal analysis
MoneyMONEY$2.5 billion, €500 millionFinancial quantities
PercentPERCENT2.3%, 5 percentRate tracking
import spacy

# Use the small model for speed; for best accuracy use en_core_web_trf
nlp = spacy.load("en_core_web_sm")

# Analyze the 2024 FOMC statement
text = df.loc[4, "text"]
doc = nlp(text)

print("Named Entities Found:\n")
print(f"{'Entity':30s} {'Type':12s} Explanation")
print("-" * 65)
for ent in doc.ents:
    print(f"{ent.text:30s} {ent.label_:12s} {spacy.explain(ent.label_)}")
Output
Named Entities Found: Entity Type Explanation ----------------------------------------------------------------- 2 percent PERCENT Percentage, including "%" 1/2 CARDINAL Numerals that do not fall under another type 4-3/4 CARDINAL Numerals that do not fall under another type 5 percent PERCENT Percentage, including "%" 2 percent PERCENT Percentage, including "%"

For more complex NER (extracting organization names, people, geopolitical entities from news or research texts), use the transformer-based spaCy model or a HuggingFace NER pipeline:

from transformers import pipeline

# HuggingFace NER pipeline (BERT-based)
ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"   # merge sub-word tokens
)

# A richer example text
rich_text = """Federal Reserve Chair Jerome Powell announced on Tuesday that
the central bank would maintain interest rates. Goldman Sachs analysts
expect GDP growth of 2.3% in the United States for 2025."""

results = ner_pipeline(rich_text)
for entity in results:
    print(f"{entity['word']:25s} {entity['entity_group']:8s} score={entity['score']:.3f}")
Output
Federal Reserve ORG score=0.987 Jerome Powell PER score=0.996 Goldman Sachs ORG score=0.994 United States LOC score=0.998

10A.7  Zero-Shot Classification

Traditional text classification requires labeled training data — hundreds or thousands of examples per category. Zero-shot classification lets you classify text into categories you define at inference time, with no training examples at all. This is a game-changer for research, where creating labeled datasets is expensive and time-consuming (Yin, Hay & Roth, 2019).

The Core Trick: Reframing Classification as Textual Entailment

The insight behind zero-shot classification is a clever reframing of the problem. Instead of training a model to classify text into categories, we use a model already trained on Natural Language Inference (NLI) — the task of deciding whether one sentence logically follows from another.

An NLI model takes two inputs: a premise (a statement of fact) and a hypothesis (a claim), and outputs one of three judgments: entailment (the hypothesis follows from the premise), contradiction (they conflict), or neutral (can’t tell).

Zero-Shot Classification: Step-by-Step Intuition
Your text
“The Committee decided to raise the target range for the federal funds rate.”
For each candidate label, construct a hypothesis
Label 1
Premise: [your text] Hypothesis: “This text is about monetary policy tightening.” P(entailment) = 0.94
Label 2
Premise: [your text] Hypothesis: “This text is about labor market.” P(entailment) = 0.12
Label 3
Premise: [your text] Hypothesis: “This text is about technology.” P(entailment) = 0.02
Rank by entailment probability → predicted label = highest score
Result
monetary policy tightening (score: 0.94)

The beauty of this approach is that the NLI model (typically BART or RoBERTa trained on the MultiNLI dataset of 433,000 sentence pairs; Williams, Nangia & Bowman, 2018) has never seen any of your specific labels during training. It generalizes because it has learned the deep semantic relationship between premises and hypotheses — essentially, it has learned what it means for one statement to “be about” something.

Multi-Label vs. Single-Label Classification

Setting multi_label=True means each label is scored independently (a document can be about both “inflation” and “labor market” simultaneously). With multi_label=False, the scores across labels must sum to 1, forcing a single best classification. For exploratory research, multi_label=True is usually more informative.

from transformers import pipeline

# Load zero-shot classifier
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

# Define policy categories (no training data needed!)
policy_labels = [
    "monetary policy tightening",
    "monetary policy easing",
    "economic growth",
    "labor market",
    "inflation concerns",
    "financial stability"
]

print("Zero-Shot Policy Classification\n")
for _, row in df.iterrows():
    result = classifier(
        row["text"],
        candidate_labels=policy_labels,
        multi_label=True   # a statement can be about multiple topics
    )
    # Show top 3 labels
    top3 = [(l, f"{s:.2f}") for l, s in zip(result["labels"][:3], result["scores"][:3])]
    print(f"{row['date'].strftime('%Y-%m')}: {top3}")
Output
Zero-Shot Policy Classification 2008-12: [('monetary policy easing', '0.94'), ('financial stability', '0.87'), ('labor market', '0.72')] 2015-12: [('monetary policy tightening', '0.91'), ('labor market', '0.88'), ('economic growth', '0.65')] 2020-03: [('monetary policy easing', '0.96'), ('financial stability', '0.71'), ('economic growth', '0.54')] 2022-06: [('inflation concerns', '0.95'), ('monetary policy tightening', '0.92'), ('economic growth', '0.58')] 2024-09: [('monetary policy easing', '0.89'), ('inflation concerns', '0.73'), ('economic growth', '0.71')]
Why This Matters for Research

Notice how accurately the model classifies each statement — without seeing a single labeled example. The 2008 and 2020 crises are correctly identified as “monetary policy easing,” the 2022 inflation period as “inflation concerns” + “tightening,” and the 2024 pivot as “easing.” In a research setting, this means you can classify thousands of documents into categories of your own design without any hand-labeling. Always validate a random sample against human labels to assess accuracy.

10A.8  Semantic Similarity & Embeddings

Embeddings are arguably the most important conceptual breakthrough in modern NLP. The idea is deceptively simple: represent each word (or sentence) as a point in a high-dimensional space, such that words with similar meanings are near each other.

The Core Intuition: Meaning as Geometry

Consider a 2D space where the x-axis represents “economic sentiment” and the y-axis represents “policy action.” In this space:

In this space, “recession” and “crisis” are near each other (similar meaning), while “recession” and “growth” are far apart (opposite meanings). Crucially, the direction from “recession” to “growth” is similar to the direction from “decline” to “recovery” — they capture the same semantic relationship. This is the famous “vector arithmetic” property: king − man + woman ≈ queen (Mikolov et al., 2013).

Real embeddings live in 100–768 dimensions (not 2), but the geometric intuition holds. The distance between two points tells you how semantically similar the corresponding words are. This is measured using cosine similarity:

Cosine Similarity cos(θ) = (A · B) / (||A|| × ||B||)

= 1.0 when vectors point in the same direction (identical meaning)
= 0.0 when vectors are perpendicular (unrelated)
= −1.0 when vectors point in opposite directions (opposite meaning)

How Are Embeddings Learned? The Skip-Gram Intuition

Word2Vec’s Skip-gram model (Mikolov et al., 2013) learns embeddings by training a simple neural network on a single task: given a word, predict the words that tend to appear near it.

Slide a window across the text. For each word (the “center”), note the surrounding words (the “context”, typically 5 words on each side).
“The committee decided to [raise] the target range”   center = “raise”, context = {committee, decided, to, the, target}
Train the model to predict context from center. The model has a single hidden layer (the embedding layer). Its weights become the word vectors. Words that predict similar contexts get pushed to similar positions in the vector space.
The key insight: “raise” and “increase” appear in similar contexts (both preceded by “decided to”, both followed by “the target range”). So the model learns similar vectors for them — even though they never appear in the same sentence. This is the distributional hypothesis at work: “You shall know a word by the company it keeps” (Firth, 1957).

Static vs. Contextual Embeddings

Static (Word2Vec, GloVe)

One vector per word, regardless of context.

“bank” = [0.2, 0.8, …]
Same vector whether “river bank” or “central bank”.

Pro: Fast, small models (<1 GB)
Con: Cannot handle polysemy

vs.

Contextual (BERT, Sentence-Transformers)

Different vector depending on context.

“river bank” = [0.1, 0.9, …]
“central bank” = [0.8, 0.3, …]

Pro: Handles ambiguity, state-of-the-art
Con: Slower, larger models (1–4 GB)

For most tasks in 2025, sentence-transformers (Reimers & Gurevych, 2019) are the recommended approach. They encode entire sentences or paragraphs into a single dense vector, making them ideal for document similarity, semantic search, and clustering.

Static Embeddings (Word2Vec)

Although superseded by contextual models for most tasks, Word2Vec remains important: it is computationally cheap, its geometric properties are well-understood, and it is widely used in published economics research (e.g., measuring changes in the meaning of “inflation” across decades of Fed communications; Ash & Hansen, 2023).

from gensim.models import Word2Vec

# Train Word2Vec on our (small) corpus
# In practice, you'd use a much larger corpus or pre-trained vectors
sentences = df["tokens"].tolist()

model = Word2Vec(
    sentences,
    vector_size=50,      # embedding dimensions
    window=5,             # context window size
    min_count=1,          # include all words (small corpus)
    workers=4,            # parallel training threads
    sg=1,                 # 1 = Skip-gram (better for small data)
    seed=42,              # reproducibility
    epochs=100            # more passes for small corpus
)

# Find words similar to "inflation"
try:
    similar = model.wv.most_similar("inflation", topn=5)
    print("Words most similar to 'inflation':")
    for word, score in similar:
        print(f"  {word:20s} cosine_similarity={score:.3f}")
except KeyError:
    print("'inflation' not in vocabulary (corpus too small)")

Sentence Embeddings (Sentence-Transformers)

Sentence-transformers encode entire sentences or paragraphs into dense vectors. Unlike Word2Vec, the same word gets a different vector depending on its context. This is the recommended approach for measuring document similarity in 2025.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load a fast, high-quality embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode all FOMC statements into 384-dimensional vectors
embeddings = model.encode(df["text"].tolist())
print(f"Embedding shape: {embeddings.shape}")  # (5, 384)

# Compute pairwise similarity
sim_matrix = cosine_similarity(embeddings)

# Display as a labeled matrix
labels = [d.strftime("%Y-%m") for d in df["date"]]
print(f"\n{'':<10s}" + "  ".join(f"{l:>8s}" for l in labels))
for i, label in enumerate(labels):
    row_str = "  ".join(f"{sim_matrix[i][j]:8.3f}" for j in range(len(labels)))
    print(f"{label:<10s}{row_str}")
Output
Embedding shape: (5, 384) 2008-12 2015-12 2020-03 2022-06 2024-09 2008-12 1.000 0.673 0.741 0.640 0.667 2015-12 0.673 1.000 0.634 0.718 0.799 2020-03 0.741 0.634 1.000 0.625 0.651 2022-06 0.640 0.718 0.625 1.000 0.812 2024-09 0.667 0.799 0.651 0.812 1.000
Reading the Similarity Matrix

The highest similarity (0.812) is between the 2022 and 2024 statements — both discuss inflation dynamics and rate adjustments. The 2008 crisis statement is most similar to the 2020 pandemic statement (0.741) — both describe economic deterioration and emergency easing. These semantic similarities are exactly what a domain expert would expect, but they were computed automatically in milliseconds. This technique scales to millions of documents.

10A.9  LLMs for Text Analysis at Scale

Large Language Models (Claude, GPT-4) represent a paradigm shift in text analysis. Unlike all previous methods, LLMs can perform complex, multi-dimensional analysis in a single pass — extracting sentiment, topics, entities, causal claims, and nuanced reasoning simultaneously. They do this through natural language instructions (prompts) rather than specialized training.

When to Use LLMs vs. Traditional Methods

CriterionTraditional NLPLLM APIs
Cost per document Near-zero (local compute) $0.001–$0.05 per document
Speed 1,000–100,000 docs/minute 1–30 docs/minute (API-limited)
Task complexity One task at a time Multiple tasks in one pass
Setup effort Install libraries, train/fine-tune Write a prompt (minutes)
Reproducibility Fully reproducible (deterministic) Approximately reproducible (temp=0)
Interpretability Method-dependent (high for TF-IDF) Can ask for explanations, but black-box
The Practical Decision Rule

Use traditional NLP when you have >10,000 documents, need perfect reproducibility, or need a single well-defined task (sentiment, topic assignment). Use LLMs when you have <5,000 documents, need complex multi-dimensional coding, or need to prototype quickly before investing in a custom pipeline. For many research projects, the optimal strategy is to prototype with LLMs, then validate and scale with traditional methods.

Prompt Engineering for Text Analysis

The quality of LLM-based text analysis depends almost entirely on the quality of the prompt. Three principles matter most (Ziems et al., 2024):

Be explicit about output format. Always request structured output (JSON) with defined fields. This makes results parseable and consistent across documents. Never ask for free-text analysis when you need quantitative data.
"Return ONLY valid JSON with these fields: {"sentiment": "positive/negative/neutral", "confidence": 0.0-1.0}"
Provide a clear coding scheme. Define every category precisely, as you would in a codebook for human research assistants. Ambiguous categories produce noisy results. Include edge cases and how to handle them.
"Classify as 'hawkish' if the statement signals rate increases or inflation concern. Classify as 'dovish' if it signals rate cuts or economic support. Classify as 'neutral' if the policy stance is unchanged."
Set temperature to 0 and pin the model version. Temperature controls randomness; 0 makes the output as deterministic as possible. Model versions change over time, so always record the exact model ID (e.g., claude-sonnet-4-20250514, not just claude).
LLMs in Research: Proceed with Care

LLMs are powerful but introduce reproducibility challenges: model updates can change outputs silently, results may not be fully deterministic, and the “reasoning” is not inspectable. For published research, always: (1) pin the exact model version, (2) set temperature=0, (3) validate LLM outputs against human labels on a random sample, (4) report the prompt template in full in your appendix (Gilardi, Alizadeh & Kubli, 2023).

import anthropic
import json

# Initialize the Claude client
client = anthropic.Anthropic(api_key="your-api-key")  # use env variable in practice

def analyze_fomc_statement(text):
    """Analyze an FOMC statement using Claude with structured output."""

    prompt = f"""Analyze this FOMC statement. Return ONLY valid JSON with these fields:

{{
  "overall_sentiment": "hawkish" | "dovish" | "neutral",
  "confidence": 0.0 to 1.0,
  "rate_decision": "raise" | "lower" | "hold",
  "key_concerns": ["concern1", "concern2", ...],
  "forward_guidance": "brief summary of what the Fed signals about future policy",
  "economic_conditions": {{
    "growth": "expanding" | "contracting" | "mixed",
    "labor_market": "strong" | "weak" | "mixed",
    "inflation": "above_target" | "at_target" | "below_target"
  }}
}}

FOMC Statement:
{text}"""

    response = client.messages.create(
        model="claude-sonnet-4-20250514",   # pin exact version
        max_tokens=1024,
        temperature=0,                     # deterministic output
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.content[0].text)

# Analyze each statement
for _, row in df.iterrows():
    result = analyze_fomc_statement(row["text"])
    print(f"\n{row['date'].strftime('%Y-%m')}:")
    print(f"  Sentiment: {result['overall_sentiment']} (conf={result['confidence']})")
    print(f"  Rate decision: {result['rate_decision']}")
    print(f"  Key concerns: {result['key_concerns']}")
    print(f"  Forward guidance: {result['forward_guidance']}")
Expected Output
2008-12: Sentiment: dovish (conf=0.95) Rate decision: lower Key concerns: ['labor market deterioration', 'declining consumer spending', 'strained financial markets'] Forward guidance: Emergency rate cut to near-zero; focus on stabilizing the financial system 2015-12: Sentiment: hawkish (conf=0.80) Rate decision: raise Key concerns: ['inflation below target', 'medium-term inflation expectations'] Forward guidance: First rate hike in nearly a decade; policy remains accommodative 2020-03: Sentiment: dovish (conf=0.95) Rate decision: lower Key concerns: ['coronavirus disruption', 'risks to economic outlook'] Forward guidance: Emergency cut to zero; will maintain until economy weathers the crisis 2022-06: Sentiment: hawkish (conf=0.90) Rate decision: raise Key concerns: ['elevated inflation', 'supply-demand imbalances', 'energy prices'] Forward guidance: Signals ongoing rate increases are coming 2024-09: Sentiment: dovish (conf=0.75) Rate decision: lower Key concerns: ['inflation still somewhat elevated', 'slowing job gains'] Forward guidance: Gained confidence inflation is moving sustainably toward 2%

Batch Processing with Rate Limiting

When analyzing thousands of documents, you need to handle rate limits gracefully:

import time
import json
from pathlib import Path

def batch_analyze(texts, labels, output_path, delay=0.5):
    """Process a corpus with checkpointing and rate limiting.

    Saves results incrementally so you don't lose progress if
    the script is interrupted.
    """
    results = []
    output_file = Path(output_path)

    # Resume from checkpoint if exists
    if output_file.exists():
        with open(output_file) as f:
            results = json.load(f)
        print(f"Resuming from checkpoint: {len(results)} already done")

    for i in range(len(results), len(texts)):
        try:
            result = analyze_fomc_statement(texts[i])
            result["label"] = labels[i]
            results.append(result)
        except Exception as e:
            results.append({"label": labels[i], "error": str(e)})

        # Save checkpoint every 10 documents
        if (i + 1) % 10 == 0:
            with open(output_file, "w") as f:
                json.dump(results, f, indent=2)
            print(f"  Checkpoint: {i+1}/{len(texts)} processed")

        time.sleep(delay)  # respect API rate limits

    # Final save
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)

    return results

10A.10  Putting It All Together

Let’s combine every technique into a single, end-to-end analysis pipeline. This is the kind of script you would use for a research paper or business report.

"""
Complete Text Analysis Pipeline: FOMC Statements
=================================================
This script demonstrates a full, reproducible text analysis workflow.
It can serve as a template for any corpus-level text analysis project.
"""

# === 0. SETUP ===
import random, numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.sentiment import SentimentIntensityAnalyzer
from sentence_transformers import SentenceTransformer
import spacy

random.seed(42); np.random.seed(42)

# === 1. LOAD DATA ===
# (In practice, load from CSV/database)
# df = pd.read_csv("fomc_statements.csv")

# === 2. PREPROCESS ===
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    return " ".join([
        t.lemma_.lower() for t in doc
        if not t.is_stop and t.is_alpha and len(t) > 1
    ])

df["clean"] = df["text"].apply(preprocess)

# === 3. TF-IDF: What makes each statement distinctive? ===
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_features=100)
tfidf_matrix = tfidf.fit_transform(df["clean"])
features = tfidf.get_feature_names_out()

# === 4. SENTIMENT: Multiple methods for robustness ===
sia = SentimentIntensityAnalyzer()
df["vader_compound"] = df["text"].apply(
    lambda t: sia.polarity_scores(t)["compound"]
)

# === 5. EMBEDDINGS: Semantic similarity over time ===
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embed_model.encode(df["text"].tolist())
sim_matrix = cosine_similarity(embeddings)

# === 6. VISUALIZE ===
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Sentiment over time
axes[0].plot(df["date"], df["vader_compound"], "o-", color="#2563eb", linewidth=2)
axes[0].axhline(y=0, color="gray", linestyle="--", alpha=0.5)
axes[0].set_title("FOMC Sentiment Over Time (VADER)", fontsize=13)
axes[0].set_ylabel("Compound Score")
axes[0].set_ylim(-1, 1)

# Panel B: Semantic similarity heatmap
labels = [d.strftime("%Y-%m") for d in df["date"]]
sns.heatmap(
    sim_matrix, annot=True, fmt=".2f", cmap="YlOrRd",
    xticklabels=labels, yticklabels=labels, ax=axes[1]
)
axes[1].set_title("Semantic Similarity Between Statements", fontsize=13)

plt.tight_layout()
plt.savefig("fomc_analysis.png", dpi=150, bbox_inches="tight")
plt.show()

print("Analysis complete. Results saved to fomc_analysis.png")

10A.11  Reproducibility Checklist

Reproducibility is not optional — it is the difference between a finding and an anecdote. Follow this checklist for every text analysis project, whether for a course assignment, a business report, or a journal submission.

  • Pin all library versions in a requirements.txt with exact version numbers
  • Set random seeds at the top of every script (random, numpy, torch)
  • Document every preprocessing decision: what you removed, what you kept, and why
  • Store raw data separately from processed data — never overwrite raw data
  • Version your prompts if using LLM APIs — save the exact prompt template used
  • Pin model versions: use ProsusAI/finbert not just “finbert”; use claude-sonnet-4-20250514 not just “claude”
  • Report results across multiple seeds: run with seeds 42, 123, 456, 789, 1011 and report mean ± std
  • Validate against human labels: randomly sample 100–200 documents and have a human code them
  • Report all metrics: precision, recall, F1 per class — not just accuracy
  • Use version control: commit your analysis scripts to Git (see Module 9)
  • Include a README with instructions to reproduce your results from scratch
The Gold Standard

Your analysis is reproducible when a colleague can run pip install -r requirements.txt && python analysis.py and get the same results you reported. If any step requires manual intervention, proprietary data you cannot share, or a model that has been updated since publication, document it explicitly.

References & Further Reading

Foundational Methods

  • Harris, Z. S. (1954). “Distributional Structure.” Word, 10(2–3), 146–162.
  • Spärck Jones, K. (1972). “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation, 28(1), 11–21.
  • Salton, G. & Buckley, C. (1988). “Term-Weighting Approaches in Automatic Text Retrieval.” Information Processing & Management, 24(5), 513–523.
  • Blei, D., Ng, A. & Jordan, M. (2003). “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3, 993–1022.
  • Griffiths, T. L. & Steyvers, M. (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences, 101(S1), 5228–5235.
  • Turney, P. D. & Pantel, P. (2010). “From Frequency to Meaning: Vector Space Models of Semantics.” Journal of Artificial Intelligence Research, 37, 141–188.
  • Mikolov, T. et al. (2013). “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781.
  • Hutto, C. & Gilbert, E. (2014). “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text.” AAAI ICWSM.
  • Devlin, J. et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT.

Text as Data in Economics & Finance

  • Loughran, T. & McDonald, B. (2011). “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” Journal of Finance, 66(1), 35–65.
  • Baker, S., Bloom, N. & Davis, S. (2016). “Measuring Economic Policy Uncertainty.” Quarterly Journal of Economics, 131(4), 1593–1636.
  • Hansen, S. & McMahon, M. (2016). “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” Journal of International Economics, 99, S114–S133.
  • Gentzkow, M., Kelly, B. & Taddy, M. (2019). “Text as Data.” Journal of Economic Literature, 57(3), 535–574.
  • Gentzkow, M. & Shapiro, J. M. (2010). “What Drives Media Slant? Evidence from U.S. Daily Newspapers.” Econometrica, 78(1), 35–71.
  • Ash, E. & Hansen, S. (2023). “Text Algorithms in Economics.” Annual Review of Economics, 15, 659–688.

Modern NLP Tools & Models

  • Araci, D. (2019). “FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models.” arXiv:1908.10063.
  • Reimers, N. & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP.
  • Yin, W., Hay, J. & Roth, D. (2019). “Benchmarking Zero-Shot Text Classification.” EMNLP-IJCNLP.
  • Williams, A., Nangia, N. & Bowman, S. (2018). “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.” NAACL-HLT.
  • Grootendorst, M. (2022). “BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure.” arXiv:2203.05794.
  • Li, J. et al. (2020). “A Survey on Deep Learning for Named Entity Recognition.” IEEE TKDE, 34(1), 50–70.
  • Nadeau, D. & Sekine, S. (2007). “A Survey of Named Entity Recognition and Classification.” Lingvisticae Investigationes, 30(1), 3–26.

LLMs for Text Analysis & Research Methods

  • Gilardi, F., Alizadeh, M. & Kubli, M. (2023). “ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks.” Proceedings of the National Academy of Sciences, 120(30).
  • Ziems, C. et al. (2024). “Can Large Language Models Transform Computational Social Science?” Computational Linguistics, 50(1), 237–291.
  • Huang, A. H., Wang, H. & Yang, Y. (2023). “FinBERT: A Large Language Model for Extracting Information from Financial Text.” Contemporary Accounting Research, 40(2), 806–841.

Reproducibility & Best Practices

  • ACL Rolling Review. “Responsible NLP Research Checklist.”
  • Chang, J. et al. (2009). “Reading Tea Leaves: How Humans Interpret Topic Models.” NeurIPS.
  • Blei, D. M. (2012). “Probabilistic Topic Models.” Communications of the ACM, 55(4), 77–84.
  • Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing, 3rd edition (draft).