Abhishek Bagade's blog

This is Abhishek Bagade's blog on which I try to maintain my hobby projects and other stuff. I try to add blogs related to GATE, Arduino Projects and general ML side projects.

View on GitHub

LLM Post-Training RL Methods

AI-assisted research synthesis. Verify critical claims with primary sources. Status: Completed Last updated: 2026-04-15T21:48:27+05:30 Mode: method-survey

Summary

Overview

Post-training is the stage where a pretrained or instruction-tuned model is pushed toward useful behavior after generic next-token learning. In practice, this is where teams try to make a model more helpful, more aligned, better at reasoning, more controllable, or safer in deployment.

The confusing part is that many articles introduce these methods as if they are interchangeable upgrades on one ladder. They are not. Each process changes a different part of the training loop and helps for a different reason. The real topic is therefore not just a list of acronyms. It is an explanation of what each stage does to the model and why that stage exists.

Background

A good mental model is that post-training needs three ingredients:

Different methods differ mainly in the second and third ingredients.

The broad pipeline before method-specific details

  1. Pretraining teaches broad language competence.
  2. Supervised fine-tuning (SFT) teaches basic instruction-following or task format.
  3. Post-training methods then use preferences, critiques, or verifiable rewards to further shape behavior.

That means many of the named methods below are not replacements for pretraining or even always for SFT. They usually sit on top of an already competent base policy.

Core Analysis

Problem framing

The central problem is: how do we reliably teach the model which outputs are better when “better” is expensive to label, partly subjective, and easy to game?

Different post-training methods answer that question differently:

Method families

Step 1: Supervised Fine-Tuning (SFT)

What it does: SFT teaches the model to imitate high-quality answers on curated examples. It is usually the stage where a raw base model learns basic assistant behavior: answer the question, use the requested format, follow simple instructions, and stay on topic.

How it helps: SFT creates a usable starting policy. Without it, later preference or RL stages have to optimize a model that may still answer in the wrong format, ignore instructions, or produce unstable completions. In other words, SFT does not solve nuanced alignment by itself, but it gives the later stages something worth refining.

What it does not solve well: It mostly imitates demonstrations. That means it is weaker at teaching subtle tradeoffs like “be more helpful but not too verbose” or “prefer safer answer A over plausible but risky answer B” when those tradeoffs are not exhaustively demonstrated.

PPO-style RLHF

What it does: This is the classic RLHF pipeline. It usually has four moving parts:

  1. collect demonstrations to get an SFT policy
  2. sample multiple model answers
  3. ask humans to rank those answers
  4. train a reward model on those rankings, then optimize the policy online with PPO

The core objective is:

\(\max_\theta \; \mathbb{E}_{x, y \sim \pi_\theta(\cdot|x)}[r_\phi(x,y)] - \beta\,\mathrm{KL}(\pi_\theta(\cdot|x)\|\pi_{ref}(\cdot|x))\) Plain English: improve outputs that the reward model scores highly, while penalizing the model if it drifts too far from a trusted reference policy. Variables:

What the reward model does: The reward model turns pairwise or ranked human judgments into a reusable scoring function. Instead of asking humans to compare every future answer during optimization, you train a model that predicts which output humans would prefer.

How PPO helps: PPO is the online optimization step that actually changes the policy using reward-model scores. It helps by letting the current model sample fresh outputs, get scored, and then update itself carefully rather than jumping too far in one step. The KL penalty is important because it prevents the policy from drifting so far toward reward maximization that it becomes weird, repetitive, or reward-hacky.

Why this process exists: This stack exists because pairwise human preference data is rich but sparse. The reward model compresses that signal, and PPO gives you a way to optimize against it online.

What it helps with:

What makes it hard: It is operationally expensive because every piece matters: data collection, reward-model quality, rollout quality, KL tuning, PPO stability, and reward hacking defenses.

DPO-family methods

What they do: DPO, ORPO, KTO, and RRHF all try to learn from preference information without building the full reward-model-plus-PPO stack.

The canonical DPO loss is:

\(\mathcal{L}_{DPO}(\theta)= -\log\sigma\Big(\beta[(\log\pi_\theta(y_w|x)-\log\pi_\theta(y_l|x))-(\log\pi_{ref}(y_w|x)-\log\pi_{ref}(y_l|x))]\Big)\) Plain English: increase the probability of the preferred answer relative to the rejected one, measured against a reference-model baseline. Variables:

How DPO helps: DPO helps by removing the explicit reward-model training stage and the online PPO loop. Instead of first learning a separate score function and then doing RL, it directly updates the model so preferred answers become more likely than dispreferred answers.

Why that matters: This simplifies the training pipeline dramatically. You still use human preference data, but you no longer need to maintain as much RL machinery. That usually means fewer moving parts, lower engineering burden, and often more stable optimization.

How the variants help:

What these methods are best at: They are best when you already have preference-style data and want a simpler way to learn from it.

What they give up: They are usually less explicitly online than PPO-style RLHF. That can matter when the model distribution shifts and you want continual updates based on fresh generations.

Online preference optimization variants

What they do: These methods try to bring back some of the advantages of online RL without always paying the full PPO-style complexity cost. They sample from the current policy, score current outputs, and optimize on-policy or near-on-policy preference objectives.

How they help: They help when static preference datasets become stale. If the current model has changed enough, old offline comparisons may no longer reflect the errors or opportunities the model now produces. Online methods keep the training signal closer to the model’s current behavior.

Why they are not always the default: They are operationally heavier than purely offline preference learning, so the extra adaptability only pays off when fresh on-policy data actually matters.

RLVR and GRPO

What RLVR does: RLVR means reinforcement learning with verifiable rewards. Instead of asking humans or a learned reward model which answer is better, you use an external checker: unit tests, exact answer matching, theorem verification, parsers, or similar automatic scoring rules.

What that changes: It changes the source of truth. The training signal is no longer “what humans preferred” but “what the verifier judged correct.”

What GRPO does: GRPO is one way to stabilize optimization in this setting by comparing multiple sampled answers for the same prompt and assigning them relative advantage.

\(A_i = \frac{r_i - \mu_r}{\sigma_r + \varepsilon}\) Plain English: each sampled answer is judged relative to the other sampled answers for the same prompt, so answers that score above the group average get positive learning signal. Variables:

How RLVR helps: It helps by replacing expensive subjective judgment with objective feedback in domains where correctness is externally checkable. That is why it matters so much for math, code, and some structured reasoning tasks.

How GRPO helps: It helps reduce variance and makes the update signal more comparative. Instead of asking whether an absolute score is good enough in the abstract, it asks which of the sampled answers for this prompt did better and by how much.

What this process is good at:

What it does not solve well: It is much weaker for things like tone, harmlessness, style, empathy, or other qualities that do not have robust external verifiers.

Constitutional / RLAIF-style methods

What they do: These methods use a written set of principles or constitutional rules to generate critiques, revisions, or preference labels with AI assistance. Instead of relying entirely on human raters, the system uses principle-guided model judgment to scale supervision.

How they help: They help when the bottleneck is not raw optimization but feedback generation. If human labeling is too slow or expensive, AI-generated critique can create more training signal.

Why this matters: Safety and policy alignment often require many subtle judgments. Constitutional or RLAIF-style pipelines help scale those judgments.

What the catch is: Their quality depends heavily on the constitution, the judge model, and the critique process. If those are flawed, the system can scale biased supervision rather than good supervision.

Representative methods

A useful way to map the families is by the main bottleneck they solve:

Tradeoffs

The main tradeoff is signal quality versus pipeline complexity.

PPO-RLHF is attractive when subtle human judgment matters and online policy shaping is worth the cost. DPO-family methods are attractive when you want much of that alignment benefit with less operational complexity. RLVR is attractive when the task has a strong verifier, because the reward is cheaper and more objective. Constitutional methods are attractive when the hard part is generating enough feedback rather than optimizing against it.

Another useful distinction is what each method assumes is available:

Established vs inferred:

Practical guidance

If you want a simple decision rule, choose the method by the strongest signal you actually trust.

Open questions

The field is still unresolved on several fronts:

Evidence and Sources

Claim cluster 1: the post-training stack is modular because each process solves a different bottleneck

Claim cluster 2: PPO-style RLHF helps with nuanced preference shaping, but at high operational cost

Claim cluster 3: DPO-family methods help by simplifying preference learning

Claim cluster 4: RLVR/GRPO helps most when correctness is checkable

Uncertainties and Competing Views

High-confidence claims:

Medium-confidence claims:

Competing views:

What evidence would change the conclusion:

Practical Takeaways

References

  1. Ziegler et al. (2019), Fine-Tuning Language Models from Human Preferences — Primary; early modern RLHF formulation.
  2. Ouyang et al. (2022), Training language models to follow instructions with human feedback — Primary; canonical InstructGPT RLHF pipeline.
  3. Huang et al. (2024), The N Implementation Details of RLHF with PPO — Secondary technical synthesis; operational complexity and implementation pitfalls.
  4. Sharma et al. (2023/2024), Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Primary; core DPO argument and objective.
  5. Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model — Primary; one-stage preference optimization variant.
  6. Ethayarajh et al. (2024), KTO: Model Alignment as Prospect Theoretic Optimization — Primary; desirability-label alternative to pairwise-only framing.
  7. Yuan et al. (2023), RRHF: Rank Responses to Align Language Models with Human Feedback — Primary; ranking-based alignment objective.
  8. Xu et al. (2024), Online Preference Optimization for Language Model Alignment — Primary; representative online preference optimization reference.
  9. Guo et al. (2024), DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Primary; important GRPO-context reference.
  10. Zheng et al. (2025), Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs — Primary; argues RLVR can improve reasoning quality, not just sample efficiency.
  11. Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback — Primary; key RLAIF / constitutional framing.