The Path Not Taken: RLVR Provably Learns Off the Principals
Best AI papers explained - Un pódcast de Enoch H. Kang
Categorías:
This paper studies mechanistic explanation for the paradox that **Reinforcement Learning with Verifiable Rewards (RLVR)** reliably improves large language model reasoning while making only minimal, sparse changes to parameters. The authors introduce the **Three-Gate Theory**, arguing that sparse updates are a surface artifact of a **model-conditioned optimization bias**. **Gate I (KL Anchor)** constrains each update, while **Gate II (Model Geometry)** steers the updates off the principal, high-curvature directions favored by **Supervised Fine-Tuning (SFT)** and into low-curvature subspaces, thereby preserving the model's spectral structure. **Gate III (Precision)** amplifies the appearance of sparsity by masking small updates in non-preferred regions due to bfloat16 storage limits. Consequently, the work demonstrates that **RLVR learns in a distinct optimization regime from SFT**, which suggests that SFT-era parameter-efficient fine-tuning (PEFT) techniques are often ill-suited for RL applications.
