Reinforcement Learning
Search documents
DeepSeek-R1登上Nature封面:朝着AI透明化迈出的可喜一步
3 6 Ke· 2025-09-18 02:02
Core Insights - The value of open-source artificial intelligence (AI) is gaining broader recognition, highlighted by the publication of the DeepSeek-R1 paper in the prestigious journal Nature, with founder Liang Wenfeng as the corresponding author [1][5]. Research Findings - The research team hypothesized that human-defined reasoning patterns might limit model exploration, and unrestricted reinforcement learning (RL) training could better stimulate the emergence of new reasoning capabilities in large language models (LLMs) [3][8]. - Experiments demonstrated that the reasoning ability of LLMs can be enhanced through pure RL, reducing the need for human input, and outperforming traditionally trained LLMs in tasks such as mathematics, programming competitions, and graduate-level STEM problems [3][9]. Model Evaluation - Following the launch of DeepSeek-R1, it received widespread acclaim from global developers, achieving 91.1k stars on GitHub [4]. - Nature's editorial recognized DeepSeek-R1 as the first mainstream LLM published after peer review, marking a significant step towards transparency in AI [5][17]. - The editorial emphasized the importance of peer-reviewed publications in clarifying LLM operations and assessing their authenticity [6][17]. Methodology - The research introduced a new paradigm within the RL framework, minimizing reliance on human-annotated reasoning processes and exploring the potential for LLMs to develop reasoning capabilities through self-evolution [9][10]. - The team proposed a RL algorithm called "Group Relative Policy Optimization" (GRPO) and trained various models, including DeepSeek-R1-Zero and DeepSeek-R1, based on the foundational model DeepSeek-V3 Base [10][12]. Training Phases - The training process involved multiple stages, with each subsequent model improving upon the previous one in terms of reasoning and instruction-following capabilities [14]. - DeepSeek-R1 demonstrated strong reasoning abilities aligned with human preferences, achieving superior performance across 21 mainstream benchmarks, validating the effectiveness of the RL framework [15][16]. Industry Implications - The editorial raised concerns about the lack of independent peer review for many widely used LLMs, highlighting the need for transparency and accountability in the AI industry [17][18]. - Nature called for more AI companies to submit their models for publication review, emphasizing that peer review can enhance trust and credibility in AI research [18][19].
DeepSeek登上Nature封面,梁文锋带队回应质疑,R1训练真29.4万美金
3 6 Ke· 2025-09-18 01:32
Core Insights - The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" has gained significant recognition, being featured on the cover of a leading global journal, Nature [2][4] - DeepSeek-R1 is noted as the first mainstream large language model (LLM) to undergo a peer review process, which has set a precedent for transparency in AI development [7] Model Performance and Popularity - After its open-source release, DeepSeek-R1 became the most downloaded model on Hugging Face, surpassing 10.9 million downloads [4] - The model demonstrated a remarkable improvement in reasoning capabilities, achieving an average problem-solving accuracy (pass@1) of 77.9%, and up to 86.7% with "self-consistent decoding" technology [10] Training Costs and Efficiency - The training cost for DeepSeek-R1 was reported at $294,000, significantly lower than the costs incurred by companies like OpenAI and Google [5][6] - The training process involved 147,000 GPU hours, with a breakdown of costs for different training phases [6] Innovative Training Approach - DeepSeek-R1-Zero was developed by completely discarding human reasoning patterns, utilizing a simplified reinforcement learning framework [8][10] - The model was trained with a focus on two main components: task format and reward signals based on the correctness of final answers [10] Self-Evolution and Advanced Reasoning - During training, the model exhibited self-evolution behaviors, increasing the length of generated text in the "think" tag and developing advanced reasoning strategies [12][15] - A notable "Aha Moment" was observed when the model began using the word "wait" more frequently, indicating a shift in its reasoning process [16][18] Multi-Stage Training Process - The training process consists of multiple stages, including cold start, reinforcement learning, large-scale supervised fine-tuning, and a second round of reinforcement learning [19][20] - Each stage is designed to enhance different aspects of the model's capabilities, from initial fine-tuning to improving language consistency and general knowledge [20][35] Reward System Design - DeepSeek implemented a dual-track reward system, combining rule-based rewards for reasoning tasks and model-based rewards for general tasks [27][30] - The rule-based rewards focus on accuracy and format compliance, while the model-based rewards assess the usefulness and safety of the outputs [28][31] Challenges and Future Directions - Despite its advanced reasoning capabilities, DeepSeek-R1 faces limitations in structured outputs and tool usage, and it is sensitive to prompt variations [43] - The reliance on reliable reward signals poses challenges, particularly for subjective tasks, which may lead to reward hacking [44]
让机器人「不只是走路」,Nav-R1引领带推理的导航新时代
机器之心· 2025-09-18 01:01
Core Insights - The article discusses the challenges in enabling robots to understand and execute complex navigation commands in real-world environments, emphasizing the need for improved reasoning, path planning, and action execution capabilities [2][4]. Group 1: Key Innovations - The paper introduces a new foundational model called Nav-R1, which integrates perception, reasoning, and action in 3D environments, enhancing the robot's ability to think clearly before acting [5]. - A large dataset, Nav-CoT-110K, consisting of approximately 110,000 Chain-of-Thought trajectories, is constructed to facilitate cold-start training, allowing the model to learn reasoning and action decision-making before reinforcement learning optimization [8]. - Nav-R1 employs three complementary reward mechanisms during reinforcement learning: Format Reward, Understanding Reward, and Navigation Reward, which collectively enhance the model's logical behavior and alignment with human expectations [9][13]. Group 2: Experimental Results - Nav-R1 demonstrates significant improvements in success rates and path efficiency across various navigation tasks, achieving approximately an 8% increase compared to other advanced methods [14]. - In real-world experiments, Nav-R1 was tested on a mobile robot platform, showing robust performance in navigating complex indoor environments such as meeting rooms and corridors [18][23]. Group 3: Practical Applications - The capabilities of Nav-R1 suggest potential applications in service robots and home assistants, where understanding and navigating cluttered environments is crucial for user experience [29]. - In healthcare settings, Nav-R1 can enhance the navigation of robots in hospitals and nursing homes, ensuring safe and reliable operation in complex environments [30]. - The model's reasoning and control capabilities are also applicable in augmented reality (AR) and virtual reality (VR) scenarios, where virtual agents need to navigate physical spaces [31]. - In industrial and hazardous environments, Nav-R1's robustness and generalization abilities make it suitable for tasks in factories, mines, and disaster sites [32].
中国大模型首登Nature封面!DeepSeek首次披露:R1训练只花了200万
量子位· 2025-09-18 00:51
Core Insights - DeepSeek has become the first Chinese large model company to be featured on the cover of Nature, with founder Liang Wenfeng as the corresponding author [2][3] - The R1 model has been recognized for its innovative approach, achieving significant performance improvements in reasoning tasks through a pure reinforcement learning framework [19][20] Group 1: Achievements and Recognition - DeepSeek's R1 model is the first large language model to undergo peer review, marking a significant milestone in the field [5] - The model has garnered 3,596 citations on Google Scholar and has been downloaded 10.9 million times from Hugging Face, indicating its widespread acceptance and use [7] - The training cost of R1 is approximately $294,000, significantly lower than competitors that often exceed $10 million, challenging the notion that high investment is necessary for top-tier AI models [12][13] Group 2: Training and Data - R1 was trained using 512 H800 GPUs for 198 hours, with a total training cost of $294,000 [10][11] - The dataset for R1 includes five types of data: Math, Code, STEM, Logic, and General, with a total of 126,000 prompts [15][18] - The model's training involved a combination of cold-start data, reinforcement learning, and supervised fine-tuning, enhancing its reasoning capabilities [25][26] Group 3: Performance Metrics - DeepSeek-R1-Zero achieved a pass@1 score of 71.0% in AIME 2024, significantly improving from 15.6% [21] - In comparison to other leading models, DeepSeek-R1 demonstrated competitive performance across various benchmarks, including MATH-500 and LiveCode [23][30] - The distilled models from DeepSeek-R1 outperformed direct applications of reinforcement learning on the base model, showcasing the effectiveness of the training approach [29] Group 4: Safety and Transparency - DeepSeek has released a detailed safety assessment of the R1 model, indicating a moderate inherent safety level comparable to GPT-4o [18][22] - The company has embraced transparency by open-sourcing the model weights for DeepSeek-R1 and DeepSeek-R1-Zero on Hugging Face, promoting community engagement [30]
DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
3 6 Ke· 2025-09-18 00:45
太令人意外! 却又实至名归! 最新一期的 Nature 封面,竟然是 DeepSeek-R1 的研究。 也就是今年 1 月份 DeepSeek 在 arxiv 公布的论文《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》。这篇Nature论文 通讯作者正是梁文锋。 论文链接: https://www.nature.com/articles/s41586-025-09422-z 在封面的推荐介绍中,Nature 写到: 如果训练出的大模型能够规划解决问题所需的步骤,那么它们往往能够更好地解决问题。这种『推理』与人类处理更复杂问题的方式类似,但 这对人工智能有极大挑战,需要人工干预来添加标签和注释。在本周的期刊中,DeepSeek 的研究人员揭示了他们如何能够在极少的人工输入 下训练一个模型,并使其进行推理。 DeepSeek-R1 模型采用强化学习进行训练。在这种学习中,模型正确解答数学问题时会获得高分奖励,答错则会受到惩罚。结果,它学会了推 理——逐步解决问题并揭示这些步骤——更有可能得出正确 ...
刚刚,梁文锋发Nature了
3 6 Ke· 2025-09-17 23:43
昨晚,DeepSeek再度开创历史! 智东西9月18日报道,9月17日,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研究论文,登上了国际权威期刊《自 然(Nature)》的封面。 DeepSeek-R1论文首次公开了仅靠强化学习,就能激发大模型推理能力的重要研究成果,启发全球AI研究者;这一模型还成为全球最受欢迎的 开源推理模型,Hugging Face下载量超1090万次。此番获得《自然》的认证,可谓是实至名归。 与此同时,DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。《自然》在社论中高度评价道:几乎所有主流的大模型都还没有经过 独立同行评审,这一空白"终于被DeepSeek打破"。 《自然》认为,在AI行业中,未经证实的说法和炒作已经"司空见惯",而DeepSeek所做的一切,都是"迈向透明度和可重复性的可喜一步"。 《自然》杂志封面标题:自助——强化学习教会大模型自我改进 发表在《自然》杂志的新版DeepSeek-R1论文,与今年1月未经同行评审的初版有较大差异,披露了更多模型训练的细节,并正面回应了模型 发布之初的蒸馏质疑。 | https:// ...
《Science Robotics》封面:DeepMind发布RoboBallet,重新定义多机器人协同规划
机器人大讲堂· 2025-09-17 11:13
Core Viewpoint - Multi-robot systems are becoming a key technology for improving production efficiency in modern industrial manufacturing, but face significant challenges in coordinating multiple robots in shared environments [1][4]. Group 1: Challenges in Multi-Robot Coordination - Three core sub-problems must be solved for effective multi-robot coordination: motion planning, task scheduling, and task assignment, each presenting significant computational challenges [3][4]. - Motion planning requires collision-free path planning for each robot, which becomes exponentially complex as the number of robots and obstacles increases [3]. - Task scheduling is akin to the classic Traveling Salesman Problem, with computational complexity that escalates with the number of tasks [3]. - Task assignment involves determining which robot performs which task, with costs dependent on other tasks' assignments, creating a coupled relationship among the three sub-problems [3][4]. Group 2: RoboBallet Framework - RoboBallet is a novel framework developed by engineers from University College London and Google DeepMind, combining Graph Neural Networks (GNN) and Reinforcement Learning (RL) to automate the resolution of multi-robot coordination issues [4][5]. - The framework represents the collaborative scene as a dynamic graph, where nodes represent individual robots and edges denote their interactions based on spatial proximity [5]. - GNN efficiently processes structured information, allowing the model to generalize well to unseen configurations of obstacles and tasks [5]. Group 3: Training and Performance - RoboBallet employs a fine-tuned TD3 algorithm for training the policy network, enabling the generation of multi-robot trajectories while addressing task assignment, scheduling, and motion planning [7]. - The reward mechanism includes task completion rewards and collision penalties, promoting efficient task execution while avoiding collisions [7]. - The model is trained in randomly generated environments, allowing it to learn effective coordination strategies through millions of interactions [7][9]. Group 4: Computational Efficiency and Scalability - RoboBallet demonstrates impressive computational efficiency, achieving planning steps in approximately 0.3 milliseconds even with a maximum configuration of 8 robots, 40 tasks, and 30 obstacles [8]. - The framework's inference time scales linearly with the number of robots, tasks, and obstacles, making it feasible for real-time applications [11]. - Increasing the number of robots significantly enhances task execution efficiency, with average execution time dropping from 7.5 seconds to 4.5 seconds (a 40% reduction) when the number of robots is increased from 4 to 8 [12].
X @s4mmy
s4mmy· 2025-09-15 15:49
Find this useful? Give it a like & share with friendsWant more? I produce a FREE weekly newsletter on Tuesdays; link in bioDisclaimer: I hold investments & have existing partnerships with some of the Agents/protocols mentioned abovehttps://t.co/MdbljS5vvus4mmy (@S4mmyEth):Meta just revealed a 25x faster method of training AI using Reinforcement Learning (RL)AI and Robotics will continue to gain traction as models evolveHere's the roundup for the DeAI/DePAI segment this week🧵(1/9) https://t.co/nICwXpLNNB ...