《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

RL: Supervised Outcomes, Unsupervised Processes

已有 611 次阅读 2025-2-1 13:58 |个人分类:AI 浪潮|系统分类:科研笔记

In reading DeepSeek R1 paper, some may have overlooked the nuances: the training datasets are both human labeled and regenerated, blending supervised and unsupervised reinforcement learning (RL).

How so?

From the perspective of the data's origin and gold standards, the training data is undeniably human labeled. They derive from existing math problems and human-crafted code from GitHub’s open-source community—products of years of effort by educators, developers, and others. The problems (input) and their "gold-standard" answers (output) are human-designed or labeled. In this sense, reinforcement learning (RL) represents typical end-to-end supervised learning:

Input: Math/coding problemsOutput: Verified answers

However, unlike other supervised learning, RL requires the model to learn the reasoning process leading to answers. Critically, the intermediate steps lack human annotations or feedback. Instead, the system autonomously generates these reasoning data, iteratively appending to the training set. This makes the process unsupervised. The brilliance of RL lies here: self-guided exploration, path discovery, and data regeneration.

Cold Start and Human DataDeepSeek R1’s initial training did use a small set of human-annotated reasoning data. But these couple of thousand examples pale against millions of regenerated data—effectively negligible. In fact, research like DeepSeek Zero demonstrates that such process-labeled human data is not a must-have.

Inspired by AlphaZero (which showed human data might even hinder optimal pather discovery in Go), DeepSeek Zero confirms human annotations are not necessary. The minor human data in R1’s pipeline primarily enhances readability for developers, not necessarily for enabling reasoning capability. After all, humans (including developers in debugging) prefer interpretable thought processes.

A New Paradigm: Process-Unsupervised, Outcome-Supervised LearningThis self-play/self-study style RL framework represents a novel approach: unsupervised in process but supervised in outcome. DeepSeek’s breakthrough reveals that "slow thinking" in RL—meticulously generating intermediate steps as CoT (chain of thought)—boosts performance in logical reasoning as well as non-logical tasks like creatuive writing.

As my old buddy Cheng insightfully noted:Deep reasoning inserts extensive text between questions and answers, reducing the perplexity of generating correct answers. Directly jumping from problem to answer has high perplexity, but adding a "reasoning bridge" lowers it. This follows the language model framework: the key is to search for the optimal path in text generation.

Can Unsupervised Regenerated Process Data Lead Astray?One might worry: if the model autonomously generates flawed reasoning steps in its process data, could errors compound? The answer lies in the clear supervision signal from the gold standard. Like flying a kite—held by a string in human's hands—the final reward anchors the learning. As long as the model truly scales up, outcome-oriented RL ensures deviations' self-correct probabilistically.

Mathematically, minor process imperfections or illogical steps don’t statistically compromise final accuracy. For non-logical tasks (beyond math/coding), reasoning paths may even involve contradictions and/or heavy redundancies. Yet, as long as the "slow thinking" mechanism guides learning, results remain robust—often superhuman, as demonstrated repeatedly lately by many users of R1.

Why Regenerated Data WorksRegenerated reasoning data aren’t random data from nowhere. They’re generated by a solid large foundation model trained on vast human knowledge data, following autoregressive generation (e.g. next-token prediction). While each step might drift slightly, the context grows incrementally, allowing continuous stepwise self-correction. This dynamic—probabilistic fluctuations balanced by stepwise adjustments—enhances semantic coherence and knowledge fluency in generation, lowering overall perplexity and steering toward correct outcomes. Thus, process data rarely derails; instead, it converges toward reliability.

A Final Note on Cheng’s ObservationCheng highlights a pivotal finding of DeepSeek:OpenAI’s "Let’s Verify Step by Step" argues for rewarding each reasoning step. Yet DeepSeek’s RL model achieves remarkable results using only final-outcome rewards—no Chain-of-Thought (CoT) data needed. Whether OpenAI’s process supervision is essential or simply a red herring, DeepSeek Zero’s breakthroughs redefine the field, proving outcome-oriented RL can master reasoning autonomously.

In essence, when guided by scalable outcome supervision, machines learn to self-correct, turning imperfect processes into near-perfect results.

 

 



https://blog.sciencenet.cn/blog-362400-1471242.html

上一篇:DeepSeek R1:《少年DS之烦恼》
下一篇:推理强化学习是端到端的监督,推理过程的非监督
收藏 IP: 108.65.198.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

1/0 | 总计:0 | 首页 | 上一页 | 跳转

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2025-3-8 12:11

Powered by ScienceNet.cn

Copyright © 2007-2025 中国科学报社

返回顶部