Authors: Yuqian Fu†, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu†, Dongbin Zhao

*Co-first Authors | †Project Lead | Published at Mar. 27, 2026. (Work in Progress)

Code: https://github.com/hhh675597/revisiting_opd

Paper: https://arxiv.org/abs/2603.25562

image.png

<aside> 🆕

Update

[Apr 04] Added a deep dive into OPD’s policy gradient, bias, and variance. Special thanks to Jiacai Liu and Zhuo Jiang for their valuable insights and contributions! 🥳

The Policy Gradient, Bias, and Variance of OPD

</aside>

<aside>

TL;DR


On-policy distillation (OPD) has become an increasingly common component in post-training pipelines for reasoning and agentic language models. Recent public reports from Thinking Machines Lab, Qwen3, MiMo-V2-Flash, and GLM-5 suggest a shared shift toward supervision on model-generated trajectories, or closely related on-policy distillation variants, as a complement to both off-policy distillation and reinforcement learning [3] [4][5][6]. This trend is easy to understand from a systems perspective. Once the student is expected to reason or act on its own rollouts, the training signal has to remain informative under the prefix distribution induced by the student rather than only under teacher trajectories. This raises a basic implementation question: what objective is OPD actually optimizing, and what changes when sequence-level reverse-KL is replaced by a token-level approximation?

1. Token-level vs sequence-level OPD

We first recall the objective behind OPD. For a prompt $x$, the sequence-level reverse-KL objective is

$$ \begin{aligned}J_{\text{OPD}}(\theta) &= \mathbb{E}{x\sim D}\left[ D{\mathrm{KL}}\left(\pi_\theta(\cdot \mid x)\,\|\,q(\cdot \mid x)\right) \right] \\ &=\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[\log\frac{\pi_\theta(y|x)}{q(y|x)}\right]\\ &=-\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[\log q(y|x) +\mathcal{H}\left(\pi_\theta\left(y|x\right)\right)\right], \end{aligned} $$

where $\pi_\theta$ and $q$ are the student and teacher models, respectively. This implies that OPD is a special entropy-regularized finite horizon RL problem. Its gradient can be written as

$$ \nabla_\theta J_{\text{OPD}}(\theta)

\mathbb{E}{x\sim D,\, y\sim \pi\theta(\cdot \mid x)}\left[ \big(\log \pi_\theta(y \mid x)-\log q(y \mid x)\big)\, \nabla_\theta \log \pi_\theta(y \mid x) \right]. $$

For each decoding step $t$, let $c_t = (x, y_{<t})$ denote the current context, $g_t = \nabla_\theta \log \pi_\theta(y_t \mid c_t)$ the score-function gradient on token $y_t$, and

$$ r_t = \log \frac{\pi_\theta(y_t \mid c_t)}{q(y_t \mid c_t)}. $$

Using the autoregressive factorization

$$ \log \pi_\theta(y \mid x) - \log q(y \mid x) = \sum_{t'=1}^{T} r_{t'}, \qquad \nabla_\theta \log \pi_\theta(y \mid x) = \sum_{t=1}^{T} g_t, $$

we obtain a sequence-level estimator

$$ \hat g_{\text{seq}}

\sum_{t=1}^{T} \left(\sum_{t'=1}^{T} r_{t'}\right) g_t. $$