Authors: Yuqian Fu†, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu†, Dongbin Zhao

*Co-first Authors | †Project Lead | Published at Mar. 27, 2026. (Work in Progress)

Code: https://github.com/hhh675597/revisiting_opd
Paper: https://arxiv.org/abs/2603.25562
Chinese Version: 《重探 On-Policy Distillation（OPD）：三类典型失败以及修复路径》

<aside> 🆕

Update

[May 16] Added a Chinese talk recording: Revisiting On-Policy Distillation (OPD), along with the accompanying slides below. Hope it helps provide a more accessible overview of our motivation, analysis, and key findings!

revisiting_opd_slides.pdf

[Apr 29] Paper updated to v2. We have added more theoretical analysis and experimental results (e.g., EMA-PG KL). Go check it out!

[Apr 08] Added a Chinese Interpretation on Zhihu: 《重探 On-Policy Distillation（OPD）：三类典型失败以及修复路径》

[Apr 04] Added a deep dive into OPD’s policy gradient, bias, and variance. Special thanks to Jiacai Liu and Zhuo Jiang for their valuable insights and contributions! 🥳

The Policy Gradient, Bias, and Variance of OPD

</aside>

<aside>

TL;DR

On-policy distillation (OPD) trains a student on its own rollouts using teacher feedback[1][2][3]. In long-horizon LLM post-training, the common sampled-token implementation can be brittle.
From a bias-variance perspective, token-level OPD is biased relative to sequence-level reverse-KL, but it admits a much tighter worst-case variance bound. Our toy study shows that stronger future-reward coupling substantially increases gradient variance and destabilizes optimization.
In practice, brittleness comes from three sources: an imbalanced one-token learning signal, unreliable teacher guidance on student-generated prefixes, and tokenizer/special-token mismatch.
We replace the one-sample comparison with a teacher top-K truncated reverse-KL over local support, together with top-p rollouts and special-token masking. This yields more stable training and better results on both reasoning and agentic multi-task benchmarks. </aside>

On-policy distillation (OPD) has become an increasingly common component in post-training pipelines for reasoning and agentic language models. Recent public reports from Thinking Machines Lab, Qwen3, MiMo-V2-Flash, and GLM-5 suggest a shared shift toward supervision on model-generated trajectories, or closely related on-policy distillation variants, as a complement to both off-policy distillation and reinforcement learning [3] [4][5][6]. This trend is easy to understand from a systems perspective. Once the student is expected to reason or act on its own rollouts, the training signal has to remain informative under the prefix distribution induced by the student rather than only under teacher trajectories. This raises a basic implementation question: what objective is OPD actually optimizing, and what changes when sequence-level reverse-KL is replaced by a token-level approximation?

1. Token-level vs sequence-level OPD

We first recall the objective behind OPD. For a prompt $x$, the sequence-level reverse-KL objective is

$$ \begin{aligned}J_{\text{OPD}}(\theta) &= \mathbb{E}{x\sim D}\left[ D{\mathrm{KL}}\left(\pi_\theta(\cdot \mid x)\,\|\,q(\cdot \mid x)\right) \right] \\ &=\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[\log\frac{\pi_\theta(y|x)}{q(y|x)}\right]\\ &=-\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[\log q(y|x) +\mathcal{H}\left(\pi_\theta\left(y|x\right)\right)\right], \end{aligned} $$

where $\pi_\theta$ and $q$ are the student and teacher models, respectively. This implies that OPD is a special entropy-regularized finite horizon RL problem. Its gradient can be written as

$$ \nabla_\theta J_{\text{OPD}}(\theta)

\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[ \big(\log \pi_\theta(y \mid x)-\log q(y \mid x)\big)\, \nabla_\theta \log \pi_\theta(y \mid x) \right]. $$

For each decoding step $t$, let $c_t = (x, y_{<t})$ denote the current context, $g_t = \nabla_\theta \log \pi_\theta(y_t \mid c_t)$ the score-function gradient on token $y_t$, and

$$ r_t = \log \frac{\pi_\theta(y_t \mid c_t)}{q(y_t \mid c_t)}. $$

Using the autoregressive factorization

$$ \log \pi_\theta(y \mid x) - \log q(y \mid x) = \sum_{t'=1}^{T} r_{t'}, \qquad \nabla_\theta \log \pi_\theta(y \mid x) = \sum_{t=1}^{T} g_t, $$

we obtain a sequence-level estimator