AI & Generative Media

RLHF

Also known as: Reinforcement Learning from Human Feedback, Human Feedback Training

Reinforcement Learning from Human Feedback—a technique to align AI behavior with human preferences by training on human evaluations.

RLHF trains AI models to behave according to human preferences by using human feedback as a reward signal.

Process

  1. Supervised fine-tuning: Train on human-written examples
  2. Reward modeling: Humans rank model outputs; train reward model
  3. RL optimization: Model learns to maximize reward (PPO algorithm)

Why It Matters

Base language models predict text—they don’t inherently know what’s helpful, harmless, or honest. RLHF bridges this gap by teaching models what humans actually want.

Results

  • ChatGPT’s helpfulness comes largely from RLHF
  • Reduces harmful outputs
  • Makes models follow instructions better
  • Improves coherence and usefulness

Limitations

  • Expensive: Requires extensive human labeling
  • Reward hacking: Model may game the reward
  • Preference capture: Whose preferences count?
  • RLAIF: Using AI feedback as alternative (Constitutional AI)

Alternatives

  • DPO: Direct Preference Optimization (simpler)
  • Constitutional AI: AI self-critiques against principles
  • RLAIF: AI-generated feedback