Group Sequence Policy Optimization(GSPO, 组序列策略优化), is a stable, efficient, and performant RL algorithm.
GSPO defines the importance ratio based on sequence likelihood (which aligning with the basic importance sampling)
and performs sequence-level clipping, rewarding, and optimization (by computing the normalized rewards as the advantages of multiple responses to a query).
GSPO also notably stabilizes MoE RL training.
GRPO exhibits severe stability issues when training gigantic LMs, resulting in catastrophic and irreversible model collapse.
Preliminaries
A response y to query x (in the query set D) from an autoregressive LM (a policy) πθ parameterized by θ, the likelihood under the policy is
πθ(y∣x)=t=1∏∣y∣πθ(yt∣x,y<t)
where ∣y∣ denotes the number of tokens in y. A query-response pair (x,y) can be scored by a verifier r, resulting in a reward r(x,y)∈[0,1].
PPO
PPO constrains the policy update within a proximal region of the old policy through a clipping mechanism
As the growth in model size, sparsity, and response length, it's necessary for a large rollout batch size to maximize hardware utilization during RL. To improve the sample efficiency, it's standard to partition a large batch of rollout into multiple mini-batches for gradient updates. However, this leads to an off-policy learning setting, where responses y are sampled from an old policy πθold rather than the current policy πθ being optimized. The objective of GRPO is ill-posed, which stems from a misapplication of importance sampling weights. 随着模型和输出序列(response)长度变长,强化训练(RL)时需要更大规模的 rollout batch 来提高硬件利用率与样本效率。把一个大 batch 拆成若干 mini-batch 来做多次梯度更新,会导致数据是从旧策略 πθold 采样,而在用当前策略 πθ 优化 —— 这是一个天然的 off-policy 问题。PPO/GRPO 里用 clipping 等机制是为了解决这种 off-policy 偏差,而对 GRPO 而言,问题更根本 —— 它的目标(objective)是 ill-posed,尤其在长序列/长响应任务下会引起方差爆炸与模型崩溃(collapse)。
The principle of importance sampling is to estimate the expectation of a function f under a target distribution πtar by re-weighting samples drawn from a behavior distribution πbeh
Ez∼πtar[f(z)]=Ez∼πbeh[πbeh(z)πtar(z)f(z)]
where the πbeh(z)πtar(z) is the importance weight. The IS relies on averaging over multiples samples (N≫1) from the behavior distribution πbeh for the importance weight to effectively correct for the distributional mismatch. IS 是无偏的估计,而无偏估计往往会有非常大的方差,特别是当权重 w(z) 的分布很重尾(即有少数样本权重很大)时。此时,通过对多个样本取平均(N>>1)可以降低方差,即 IS 的有效性严重依赖于用来平均的样本数与行为分布与目标分布的接近程度。
In contrast, GRPO applies the importance weight πθold(yi,t∣x,yi,<t)πθ(yi,t∣x,yi,<t) at each token position y and it fails to perform the intended distribution-correction role since the weight is based on a single sample yi,t from each next-token distribution πθold(⋅∣x,yi,<t). This introduces high-variance noise into the training gradients, which accumulates over long sequences and is exacerbated by the clipping mechanism, and further leads to irreversible model collapse.
The failure of the token-level importance weight points to a core principle: the unit of optimization objective should match the unit of reward, and the reward is granted to the entire sequence. Thus the importance weight should be performing optimization directly at the sequence-level.
单位不匹配(unit mismatch)—— 优化单元 vs. 奖励单元不一致
奖励通常是对整个序列(或基于序列的评估)给出的,而 GRPO 在 token 级别做校正。把对序列的单一 reward 分摊或直接与 token-level 修改耦合会产生错误的目标设定——优化“单元”应该与 reward 的单元一致(unit of optimization should match the unit of reward)。
The sequence-level importance weight πθold(y∣x)πθ(y∣x) reflects how far the response y sampled from πθold(⋅∣x) deviates from πθ(⋅∣x), which aligns with the sequence-level reward and serve as a meaningful indicator of the clipping mechanism.
Thus, the Group Sequence Policy Optimization employs the following sequence-level optimization objective:
GSPO applies clipping to entire responses to exclude the overly off-policy samples form gradient estimation, which matches both the sequence-level rewarding and optimization.
GSPO adopts length normalization in si(θ) to reduce the variance and to control si(θ) within a unified numerical range.
Otherwise, the likelihood changes of a few tokens can result in dramatic fluctuations of the sequence-level importance ratio, and the importance ratios of responses with different lengths will require varying clipping ranges.
Gradient Analysis
The gradient of the GSPO objective is (clipping is omitted):
GSPO can deliver continuous performance improvement through increasing the training compute, regularly updating the query set, and extending the generation length.
A key distinction of GSPO compared to GRPO is the clipping entire response rather than individual tokens, and clipping a much larger fraction of tokens leads to superior training efficiency.
Benefit of MoE Training
The sparse activation nature of MoE introduces unique stability challenges, the experts activated for the same response can change significantly between gradient updates. Thus the importance ratio of token level fluctuate drastically.
Routing ReplayCache the activated experts in πθold and replay these routing models in πθ when computing the importance ratioswi,t(θ), making each token yi,t, πθ(yi,t∣x,y<t), πθold(yi,t∣x,y<t) share the same activated network.
GSPO resolves the expert-activation volatility issue in MoE models
GSPO focuses only on the sequence likelihood πθ(yi∣x) and is not sensitive to the individual token likelihood πθ(yi,t∣x,yi,<t).