All Posts

RLHF Explained: How We Teach AI to Be Nice

ChatGPT isn't just smart; it's polite. How? Reinforcement Learning from Human Feedback (RLHF). We...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A raw LLM is a super-smart parrot that read the entire internet โ€” including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a pattern-matching engine into an assistant that is helpful, harmless, and honest.


๐Ÿ“– The Parrot Who Read Everything

Imagine a parrot that has read every book, forum post, Reddit thread, and dark corner of the web. Ask it anything and it can produce text. But it might:

  • Answer in the style of a conspiracy forum.
  • Generate offensive content because that's statistically common in its training data.
  • Give a confident-sounding wrong answer because wrong answers also appear in training data.

RLHF is the rehabilitation process that teaches this parrot which outputs humans actually prefer.


๐Ÿ”ข The Three Stages of RLHF

flowchart LR
    SFT["Stage 1: SFT\nSupervised Fine-Tuning\nHuman writes ideal answers\nโ†’ imitation learning"]
    RM["Stage 2: Reward Model\nHumans rank A vs B\nโ†’ train a preference predictor"]
    PPO["Stage 3: PPO\nRL with KL penalty\nโ†’ optimize policy to maximize reward"]

    SFT --> RM --> PPO

Stage 1 โ€” Supervised Fine-Tuning (SFT)

Human labelers write high-quality answers to a sample of prompts. The base LLM is fine-tuned to imitate this behavior. This creates the SFT Policy (ฯ€_SFT) โ€” the "before RLHF" model.

Stage 2 โ€” Reward Model Training

Human labelers are shown pairs of model outputs (A vs B) for the same prompt and asked "Which is better?" These preference labels train a Reward Model (RM) that predicts a numeric score for any (prompt, response) pair โ€” without human involvement at inference time.

Stage 3 โ€” RL Fine-Tuning with PPO

The SFT model is used as the starting policy. PPO (Proximal Policy Optimization) generates responses, scores them via the Reward Model, and updates the policy weights to maximize reward.


โš™๏ธ The KL Divergence Constraint: Why the Model Doesn't Collapse

Left unconstrained, PPO finds the response the Reward Model scores highest โ€” and it may be nonsensical text that superficially satisfies the reward function (Goodhart's Law).

The KL divergence penalty prevents this:

$$\text{Maximize: } \mathbb{E}\left[ R(x, y) - \beta \cdot \log \frac{\pi_{RL}(y|x)}{\pi_{SFT}(y|x)} \right]$$

Plain English:

  • $R(x, y)$: reward score โ€” higher is better.
  • $\beta \cdot \log(\pi{RL}/\pi{SFT})$: how much the RL policy has drifted from the SFT baseline.
  • Together: maximize reward, but penalize large deviations from the original model.

If $\beta$ is too small: the model hacks the reward function with gibberish. If $\beta$ is too large: the model barely changes from SFT.

Typical $\beta$ values: 0.01โ€“0.1.


๐Ÿง  What Human Preference Data Looks Like

Labelers evaluate output pairs using a rubric:

DimensionQuestion
HelpfulnessDoes the response directly address the prompt?
HonestyDoes it avoid false claims and express appropriate uncertainty?
HarmlessnessDoes it avoid toxic, dangerous, or offensive content?
ConcisenessIs it free of unnecessary filler and repetition?

The preference signal is a ranking (A > B), not a rating (A = 8/10, B = 6/10). Rankings are more reliable and faster to collect than absolute scores.


โš–๏ธ RLHF Limitations and Alternatives

LimitationWhy It Matters
Expensive human labelingThousands of high-quality (prompt, comparison) pairs needed โ€” skilled, well-briefed labelers required
Reward model is imperfectIt can be gamed (mode-collapse in PPO)
KL constraint is a crude fixIt prevents collapse but may limit performance ceiling
Labeler disagreementDifferent people rank the same output differently โ€” especially for subjective content

Alternatives and successors:

  • DPO (Direct Preference Optimization): Skips the RM and PPO entirely โ€” optimizes preference directly. Simpler, often competitive with RLHF. Used in Llama 3.
  • RLAIF (RL from AI Feedback): Replace human labelers with a stronger LLM-as-judge. Used in Claude (Constitutional AI).
  • PPO-Lite: A simplified PPO variant used when compute is constrained.

๐Ÿ“Œ Summary

  • RLHF = SFT โ†’ Reward Model โ†’ PPO โ€” three stages to transform a base LLM into an aligned assistant.
  • Reward Model is a trained preference predictor; it replaces the human labeler at scale.
  • KL penalty prevents PPO from collapsing the output distribution to reward-hacking gibberish.
  • DPO skips the RM and PPO entirely โ€” increasingly preferred for its simplicity.
  • RLAIF replaces human labelers with an AI judge for scalable feedback.

๐Ÿ“ Practice Quiz

  1. Why is a KL divergence penalty added to the RLHF objective?

    • A) To reduce training compute cost.
    • B) To prevent the RL-optimized model from drifting so far from the SFT baseline that it generates reward-hacking gibberish.
    • C) To speed up convergence by reducing exploration.
      Answer: B
  2. What type of human annotation does RLHF collect for training the Reward Model?

    • A) Absolute quality scores (1-10) for each response.
    • B) Pairwise preferences: "Response A is better than Response B" for the same prompt.
    • C) Token-level corrections on model outputs.
      Answer: B
  3. What is the main advantage of DPO over RLHF?

    • A) DPO uses more human feedback and is therefore more accurate.
    • B) DPO removes the RL training loop entirely โ€” it directly optimizes for preferences without a separate Reward Model or PPO.
    • C) DPO is faster at inference time.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms