12 Large Language Models

~8 hours How LLMs Work Conceptual + Practical

Learning Objectives

Understand how LLMs are trained and how they generate text
Grasp the Transformer architecture conceptually
Learn about RLHF and alignment
Use LLMs effectively through APIs and prompting
Understand limitations and risks

12.1 What Are LLMs?
12.2 How LLMs Work
12.3 Training LLMs
12.4 RLHF and Alignment
12.5 Prompt Engineering
12.6 Using LLM APIs
12.7 Limitations and Risks
12.8 LLMs in Research

12.1 What Are LLMs?

Large Language Models (LLMs) are neural networks trained to predict the next word in a sequence. Despite this simple objective, scaling up has led to emergent abilities: reasoning, coding, translation, and more.

The Key Insight

LLMs are "just" predicting the next word. But to predict well, they must learn grammar, facts, reasoning patterns, and even social conventions--all compressed into billions of parameters.

The LLM Landscape

Model Family	Organization	Access
GPT-4, GPT-4o	OpenAI	API, ChatGPT
Claude 3, 3.5	Anthropic	API, claude.ai
Gemini	Google	API, Gemini app
LLaMA 2, 3	Meta	Open weights
Mistral, Mixtral	Mistral AI	Open weights, API

12.2 How LLMs Work

Tokenization

Text is split into tokens--subword units. "Tokenization" might become ["Token", "ization"]. This handles rare words by breaking them into common pieces.

# Example: How text becomes tokens
text = "Hello, how are you doing today?"

# Tokenized (roughly):
tokens = ["Hello", ",", " how", " are", " you", " doing", " today", "?"]

# Each token maps to an integer ID
token_ids = [15496, 11, 703, 527, 499, 3815, 3432, 30]

# Models work with token IDs, not text
# Typical: 1 word ≈ 1.3 tokens (or ~4 characters per token)

Python Output

Input text: "Hello, how are you doing today?" Tokenization result (GPT-4 tokenizer): Token 0: "Hello" -> ID: 15496 Token 1: "," -> ID: 11 Token 2: " how" -> ID: 703 Token 3: " are" -> ID: 527 Token 4: " you" -> ID: 499 Token 5: " doing" -> ID: 3815 Token 6: " today" -> ID: 3432 Token 7: "?" -> ID: 30 Total tokens: 8 Words in original: 6 Ratio: 1.33 tokens per word Note: Spaces are typically attached to the following word. Punctuation is usually a separate token.

The Transformer Architecture

LLMs use the Transformer architecture (Vaswani et al., 2017). The key innovation is self-attention, which lets each token "look at" all other tokens when computing its representation.

Self-Attention Intuition

Consider: "The cat sat on the mat because it was tired."

To predict what comes after "it," the model needs to know that "it" refers to "cat." Self-attention computes a weighted average of all previous tokens, with weights learned to capture such relationships.

Next-Token Prediction

Given tokens [t1, t2, ..., tn], the model outputs a probability distribution over the vocabulary for t(n+1). During generation, it samples from this distribution and repeats.

# Simplified generation loop
prompt = "The capital of France is"
tokens = tokenize(prompt)  # [464, 3361, 315, 9822, 374]

for _ in range(max_tokens):
    # Model outputs probability for each vocabulary word
    probs = model(tokens)  # Shape: [vocab_size]

    # Sample next token (or take argmax for greedy)
    next_token = sample(probs, temperature=0.7)

    # Append and continue
    tokens.append(next_token)

    if next_token == END_TOKEN:
        break

# Detokenize back to text
output = detokenize(tokens)
# "The capital of France is Paris."

Python Output

Prompt: "The capital of France is" Tokenized: [464, 3361, 315, 9822, 374] Generation step 1: Top 5 predictions: Paris (0.89), Lyon (0.03), a (0.02), known (0.01), the (0.01) Sampled token: "Paris" (ID: 6342) Generation step 2: Top 5 predictions: . (0.72), , (0.15), and (0.05), which (0.03), - (0.02) Sampled token: "." (ID: 13) Generation step 3: Top 5 predictions: [END] (0.45), It (0.18), Paris (0.12), The (0.10), This (0.08) Sampled token: [END] (ID: 50256) Generation complete! Final output: "The capital of France is Paris." Total tokens generated: 2 Temperature used: 0.7

Temperature and Sampling

Temperature = 0: Always pick the most likely token (deterministic)
Temperature = 1: Sample according to model probabilities
Temperature > 1: More random/creative
Temperature < 1: More focused/deterministic

12.3 Training LLMs

Pre-training

LLMs are trained on massive text corpora--often trillions of tokens from the web, books, code, and more. The objective is simple: predict the next token. This is called self-supervised learning because no human labels are needed.

Model	Parameters	Training Tokens
GPT-2	1.5B	~40B
GPT-3	175B	~300B
LLaMA 2	70B	2T
GPT-4	~1.8T (rumored)	~13T (rumored)

Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022) showed that performance scales predictably with compute, data, and parameters. More of each leads to better models. This motivated the race to train ever-larger models.

12.4 RLHF and Alignment

Pre-trained models are good at predicting text but not at being helpful or safe. RLHF (Reinforcement Learning from Human Feedback) fine-tunes models to behave as intended.

The RLHF Process

Supervised Fine-Tuning (SFT): Train on high-quality examples of helpful responses
Reward Model: Train a model to predict which responses humans prefer
RL Fine-Tuning: Optimize the LLM to maximize the reward model's score

Why RLHF Matters

Without RLHF, models might:

Be unhelpfully verbose or terse
Follow harmful instructions
Hallucinate confidently
Be inconsistent in style

RLHF teaches models to be helpful, harmless, and honest--the "HHH" criteria.

Constitutional AI

Anthropic's approach uses a set of principles (a "constitution") to guide AI behavior. The model critiques and revises its own outputs according to these principles, reducing reliance on human labeling.

12.5 Prompt Engineering

Prompt engineering is the art of crafting inputs that elicit the best outputs from LLMs. Small changes in wording can dramatically affect results.

Key Techniques

Technique	Description
Few-shot	Provide examples of input-output pairs
Chain-of-thought	Ask model to show reasoning steps
Role prompting	"You are an expert economist..."
Structured output	Request JSON, markdown, or specific format

# Zero-shot
prompt = "Translate to French: Hello, how are you?"

# Few-shot
prompt = """Translate to French:
English: Hello
French: Bonjour

English: Thank you
French: Merci

English: How are you?
French:"""

# Chain-of-thought
prompt = """Solve step by step:
If a train travels 120 miles in 2 hours, then continues
for 3 more hours at the same speed, what is the total distance?

Let's think step by step:"""

# Role + structured output
prompt = """You are a data analyst. Given this data,
identify the top 3 trends. Format as JSON:
{"trends": ["trend1", "trend2", "trend3"]}"""

Python Output

=== Zero-shot Response === Prompt: "Translate to French: Hello, how are you?" Response: "Bonjour, comment allez-vous?" === Few-shot Response === Prompt: [examples + "How are you?"] Response: "Comment allez-vous?" Note: Few-shot learned the formal style from examples! === Chain-of-thought Response === Prompt: "Solve step by step: train problem..." Response: "Let's think step by step: 1. First, find the speed: 120 miles / 2 hours = 60 mph 2. The train continues for 3 more hours at 60 mph 3. Additional distance: 60 mph x 3 hours = 180 miles 4. Total distance: 120 + 180 = 300 miles The total distance is 300 miles." === Role + JSON Response === Prompt: "You are a data analyst..." Response: { "trends": [ "Revenue increased 23% year-over-year", "Customer churn decreased in Q4", "Mobile users now exceed desktop users" ] }

12.6 Using LLM APIs

# Using OpenAI API
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain regression discontinuity."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Using Anthropic API
from anthropic import Anthropic

client = Anthropic()

message = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain difference-in-differences."}
    ]
)

print(message.content[0].text)

Python Output

=== OpenAI API Response === Model: gpt-4 Tokens used: 312 (prompt: 28, completion: 284) Latency: 2.3s Response: Regression Discontinuity Design (RDD) is a quasi-experimental method used when treatment is assigned based on a cutoff. For example, if students scoring above 70% get a scholarship, we can compare students just above and just below 70% to estimate the scholarship's causal effect. The key assumption is that students near the cutoff are similar in all ways except treatment status. This creates a "local" randomized experiment around the threshold. === Anthropic API Response === Model: claude-3-sonnet-20240229 Input tokens: 15 Output tokens: 256 Latency: 1.8s Response: Difference-in-Differences (DiD) is a causal inference method that compares changes over time between a treatment group and a control group. The key idea is that the control group shows what would have happened to the treatment group absent the intervention. The parallel trends assumption is crucial: both groups must have followed similar trajectories before treatment...

12.7 Limitations and Risks

Known Limitations

Hallucinations: LLMs confidently generate false information
Knowledge cutoff: Training data has a date limit
Context limits: Can only process limited text at once
Math/reasoning: Struggles with complex calculations
No real-time info: Can't access the internet (unless given tools)

Critical for Research

Never trust LLM outputs without verification. They can generate plausible-sounding but false citations, statistics, and claims. Always verify facts from primary sources.

Ethical Considerations

Bias: Models reflect biases in training data
Privacy: May memorize and reveal private information
Misuse: Can be used for spam, disinformation, cheating
Environmental: Training requires enormous compute/energy

12.8 LLMs in Research

LLMs are increasingly used as research tools--for coding, literature review, data analysis, and writing. Use them wisely.

Appropriate Uses

Debugging code
Explaining concepts
Brainstorming research questions
Drafting and editing prose
Translating between programming languages

Inappropriate Uses

Generating citations (they hallucinate!)
Performing calculations you can't verify
Replacing critical thinking
Submitting AI-generated work as your own (check your institution's policy)

12 Large Language Models

Learning Objectives

Table of Contents

12.1 What Are LLMs?

The LLM Landscape

12.2 How LLMs Work

Tokenization

The Transformer Architecture

Next-Token Prediction

Temperature and Sampling

12.3 Training LLMs

Pre-training

Scaling Laws

12.4 RLHF and Alignment

The RLHF Process

Constitutional AI

12.5 Prompt Engineering

Key Techniques

12.6 Using LLM APIs

12.7 Limitations and Risks

Known Limitations

Ethical Considerations

12.8 LLMs in Research

Appropriate Uses

Inappropriate Uses

ProTools ER1 Assistant

ProTools ER1

Course Modules

12 Large Language Models

Learning Objectives

Table of Contents

12.1 What Are LLMs?

The LLM Landscape

12.2 How LLMs Work

Tokenization

The Transformer Architecture

Next-Token Prediction

Temperature and Sampling

12.3 Training LLMs

Pre-training

Scaling Laws

12.4 RLHF and Alignment

The RLHF Process

Constitutional AI

12.5 Prompt Engineering

Key Techniques

12.6 Using LLM APIs

12.7 Limitations and Risks

Known Limitations

Ethical Considerations

12.8 LLMs in Research

Appropriate Uses

Inappropriate Uses

ProTools ER1 Assistant