Abhishek Bagade's blog

This is Abhishek Bagade's blog on which I try to maintain my hobby projects and other stuff. I try to add blogs related to GATE, Arduino Projects and general ML side projects.

View on GitHub

LLM Post-Training RL Methods

AI-generated research draft. Verify critical claims with primary sources. Status: Completed Last updated: 2026-04-04T19:36:21+05:30

TL;DR

Background & Context

LLM post-training started with RLHF pipelines that train a reward model from human preferences and then optimize the policy with PPO under KL constraints. This works but is expensive and fragile. Newer methods aim to keep alignment quality while reducing complexity.

Taxonomy + Math + Algorithms (integrated)

A) Classical RLHF with PPO (explicit online RL)

Pipeline: 1) SFT on demonstrations 2) Reward model on preference pairs 3) PPO optimization against reward + KL regularization

Core objective: Equation (LaTeX):

\max_\theta \; \mathbb{E}_{x\sim\mathcal{D},\;y\sim\pi_\theta(\cdot|x)}\left[r_\phi(x,y)\right]
-\beta\,\mathrm{KL}\!\left(\pi_\theta(\cdot|x)\,\|\,\pi_{ref}(\cdot|x)\right)

Plain English: Train the model to get higher reward-model scores, while penalizing it if it drifts too far from a reference policy. Variables:

Token-level KL shaping (common in practice): Equation (LaTeX):

r_t^{total} = r_t^{score} - \beta\left(\log\pi_\theta(a_t|s_t)-\log\pi_{ref}(a_t|s_t)\right)

Plain English: At each token step, total reward is task/reward-model signal minus a penalty for deviating from the reference model token probability. Variables:

PPO clipped surrogate: Equation (LaTeX):

L^{PPO}(\theta)=\mathbb{E}_t\left[\min\left(\rho_tA_t,\;\mathrm{clip}(\rho_t,1-\epsilon,1+\epsilon)A_t\right)\right],
\quad
\rho_t=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

Plain English: Update the policy using advantage-weighted importance ratios, but clip updates so each step cannot change policy probability too aggressively. Variables:

Why this matters:

Main pain points:

Intuition + why this method emerged:

Code example (major repo): OpenAI lm-human-preferences (PPO RLHF)

# Repo: https://github.com/openai/lm-human-preferences
# File: lm_human_preferences/train_policy.py

# L1: KL-shaped reward (reward score + KL control)
kl = logprobs - ref_logprobs                           # compare policy vs reference
non_score_reward = -self.kl_ctl.value * kl             # adaptive KL penalty
rewards[:, -1] += scores                               # add task/reward-model score on terminal token

# L2: PPO policy-ratio objective with clipping
ratio = tf.exp(logprob - old_logprob)                  # importance ratio
pg_losses = -advantages * ratio                        # unclipped PG objective
pg_losses2 = -advantages * tf.clip_by_value(ratio, 1.0 - self.hparams.ppo.cliprange, 1.0 + self.hparams.ppo.cliprange)
pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))

# L3: trainer step loop
rollouts = self.policy.respond(queries, length=self.hparams.task.response_length)
train_stats = self.train(rollouts=rollouts)            # multiple minibatch PPO epochs
self.kl_ctl.update(stats['objective/kl'], self.hparams.ppo.batch_size)  # adaptive KL controller

B) DPO-family (RL-free / RL-light preference optimization)

DPO

For each pair $(x,y_w,y_l)$: Equation (LaTeX):

\mathcal{L}_{DPO}(\theta)= -\log\sigma\Big(\beta[(\log\pi_\theta(y_w|x)-\log\pi_\theta(y_l|x))-(\log\pi_{ref}(y_w|x)-\log\pi_{ref}(y_l|x))]\Big)

Plain English: Make preferred responses more likely than rejected responses, relative to a reference model margin, using a logistic loss. Variables:

Interpretation: directly increases preferred-vs-rejected log-prob margin relative to reference model; avoids explicit reward model + PPO loop.

ORPO

Monolithic training: NLL/SFT term plus odds-ratio preference penalty in one stage (reference-model-free formulation).

KTO

Uses binary desirability labels and prospect-theoretic human-aware loss design, rather than pairwise-only preference likelihood.

RRHF

Aligns model likelihood ordering to human ranking via ranking loss over sampled responses.

Why this family took off:

Intuition + why this method family emerged:

Code examples (major repos):

DPO — Hugging Face TRL (trl/scripts/dpo.py)

# Repo: https://github.com/huggingface/trl
# File: trl/scripts/dpo.py

# L1: load preference dataset
dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

# L2: initialize trainer with DPO objective
trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
    peft_config=peft_config,
)

# L3: optimize
trainer.train()

ORPO — Hugging Face TRL experimental ORPO (examples/scripts/orpo.py)

# Repo: https://github.com/huggingface/trl
# File: examples/scripts/orpo.py

# L1: ORPO trainer setup (single-stage SFT + odds-ratio preference pressure)
trainer = ORPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
    processing_class=tokenizer,
    peft_config=get_peft_config(model_args),
)

# L2: train
trainer.train()

KTO — Hugging Face TRL experimental KTO (examples/scripts/kto.py)

# Repo: https://github.com/huggingface/trl
# File: examples/scripts/kto.py

# L1: policy + reference models for KTO-style preference optimization
model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path)
ref_model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path)

# L2: initialize KTO trainer
trainer = KTOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    processing_class=tokenizer,
)

# L3: train
trainer.train()

RRHF — GanjinZero/RRHF (train.py)

# Repo: https://github.com/GanjinZero/RRHF
# File: train.py

# L1: ranking-aware RRHF loss (penalize wrong orderings vs human scores)
def rrhf_loss(self, scores, idxs, rw_scores):
    diff = scores.unsqueeze(0) - scores.unsqueeze(-1)
    rw_diff = rw_scores.unsqueeze(0) - rw_scores.unsqueeze(-1)
    aval = torch.bitwise_and(rw_diff > 0, diff < 0)[0]
    return -diff[aval].sum()

# L2: combine RRHF ranking loss + SFT anchor
loss = self.args.rrhf_weight * rrhf_loss + sft_loss

C) Online preference optimization variants (middle ground)

Goal: retain online sampling advantages with lighter objectives than classical PPO-RLHF.

REINFORCE-style objective (generic form): Equation (LaTeX):

\nabla_\theta J(\theta)=\mathbb{E}_{y\sim\pi_\theta(\cdot|x)}\left[(R(x,y)-b(x))\,\nabla_\theta\log\pi_\theta(y|x)\right]

Plain English: Increase probability of outputs with above-baseline reward and decrease probability of below-baseline outputs. Variables:

Typical algorithm pattern: 1) sample fresh outputs from current policy 2) score with preference model/judge 3) update with preference objective online

Tradeoff:

Intuition + why this method class emerged:

Code example (major repo): Hugging Face TRL RLOO (trl/scripts/rloo.py)

# Repo: https://github.com/huggingface/trl
# File: trl/scripts/rloo.py

# L1: define reward functions used online during rollout/update
reward_funcs = []
if script_args.reward_model_name_or_path:
    reward_funcs.append(script_args.reward_model_name_or_path)

# L2: initialize online trainer
trainer = RLOOTrainer(
    model=model_args.model_name_or_path,
    reward_funcs=reward_funcs,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    peft_config=get_peft_config(model_args),
)

# L3: online optimization loop (inside trainer)
trainer.train()

D) Constitutional AI / RLAIF

Use constitutional principles and AI critiques/preferences to reduce dependence on human pair labels.

Conceptual flow: 1) self-critique + revision supervised phase 2) preference data generated using AI constitutional judgments 3) RL or preference optimization on this AI feedback signal

Strength:

Risk:

Intuition + why this method emerged:

Code example (practical open-source pattern in major repo): RLAIF-style data consumed via DPO in TRL

# Repo: https://github.com/huggingface/trl
# File: trl/scripts/dpo.py
# Note: in practice, `dataset_name` can point to AI-judged/constitutional preference data.

# L1: load preference pairs (human- or AI-judged)
dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

# L2: run preference optimization over that data
trainer = DPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
)

# L3: optimize
trainer.train()

Practical note:

E) GRPO + RLVR (verifiable reward RL)

RLVR setup

Use externally verifiable reward function: Equation (LaTeX):

r(x,y)\in\{0,1\}\;\text{or}\;\mathbb{R}

Plain English: Reward comes from objective checkers (like tests or exact answer match), not a learned reward model. Variables:

GRPO-style relative advantage

For a prompt, sample $G$ candidates, rewards $r_i$, then normalize within group: Equation (LaTeX):

A_i=\frac{r_i-\mu_r}{\sigma_r+\varepsilon},
\quad
\mu_r=\frac{1}{G}\sum_i r_i,
\quad
\sigma_r^2=\frac{1}{G}\sum_i(r_i-\mu_r)^2

Plain English: Each candidate is judged relative to other candidates for the same prompt; higher-than-group-average rewards get positive advantage. Variables:

Then apply PPO-like update using these relative advantages.

Why this is effective:

Limit:

Intuition + why this method emerged:

Code example (major repo): Hugging Face TRL GRPO (trl/scripts/grpo.py)

# Repo: https://github.com/huggingface/trl
# File: trl/scripts/grpo.py

# L1: map string names to verifier-like reward functions
reward_funcs_registry = {
    "accuracy_reward": accuracy_reward,
    "reasoning_accuracy_reward": reasoning_accuracy_reward,
    "think_format_reward": think_format_reward,
}

# L2: initialize GRPO with chosen reward functions
trainer = GRPOTrainer(
    model=model_args.model_name_or_path,
    reward_funcs=reward_funcs,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    peft_config=get_peft_config(model_args),
)

# L3: optimize group-relative objective
trainer.train()

Comparative Snapshot

Method family Signal type Online sampling Extra reward model Strength Weakness
PPO-RLHF Human pairwise via RM Yes Yes strong control complex, brittle
DPO/ORPO/KTO/RRHF Preferences/desirability Usually no Usually no simple, stable may lag online RL in some settings
Online IPO/Online DPO Preferences Yes Optional adaptive non-trivial infra
Constitutional/RLAIF AI-evaluated principles Varies Often yes scalable safety tuning judge bias risk
RLVR + GRPO Verifiable correctness Yes No (if rules suffice) excellent in math/code weak for subjective quality

Practical Guidance

Competing Views / Uncertainty

Media & Visual Evidence

Open Questions

References (Clickable)

Spoilers (Hidden Until Requested)

Intentionally left blank. Ask to reveal spoiler analysis.

Change Log