Workflow
Reinforcement Learning
icon
Search documents
让机器人「不只是走路」,Nav-R1引领带推理的导航新时代
机器之心· 2025-09-18 01:01
Core Insights - The article discusses the challenges in enabling robots to understand and execute complex navigation commands in real-world environments, emphasizing the need for improved reasoning, path planning, and action execution capabilities [2][4]. Group 1: Key Innovations - The paper introduces a new foundational model called Nav-R1, which integrates perception, reasoning, and action in 3D environments, enhancing the robot's ability to think clearly before acting [5]. - A large dataset, Nav-CoT-110K, consisting of approximately 110,000 Chain-of-Thought trajectories, is constructed to facilitate cold-start training, allowing the model to learn reasoning and action decision-making before reinforcement learning optimization [8]. - Nav-R1 employs three complementary reward mechanisms during reinforcement learning: Format Reward, Understanding Reward, and Navigation Reward, which collectively enhance the model's logical behavior and alignment with human expectations [9][13]. Group 2: Experimental Results - Nav-R1 demonstrates significant improvements in success rates and path efficiency across various navigation tasks, achieving approximately an 8% increase compared to other advanced methods [14]. - In real-world experiments, Nav-R1 was tested on a mobile robot platform, showing robust performance in navigating complex indoor environments such as meeting rooms and corridors [18][23]. Group 3: Practical Applications - The capabilities of Nav-R1 suggest potential applications in service robots and home assistants, where understanding and navigating cluttered environments is crucial for user experience [29]. - In healthcare settings, Nav-R1 can enhance the navigation of robots in hospitals and nursing homes, ensuring safe and reliable operation in complex environments [30]. - The model's reasoning and control capabilities are also applicable in augmented reality (AR) and virtual reality (VR) scenarios, where virtual agents need to navigate physical spaces [31]. - In industrial and hazardous environments, Nav-R1's robustness and generalization abilities make it suitable for tasks in factories, mines, and disaster sites [32].
中国大模型首登Nature封面!DeepSeek首次披露:R1训练只花了200万
量子位· 2025-09-18 00:51
Core Insights - DeepSeek has become the first Chinese large model company to be featured on the cover of Nature, with founder Liang Wenfeng as the corresponding author [2][3] - The R1 model has been recognized for its innovative approach, achieving significant performance improvements in reasoning tasks through a pure reinforcement learning framework [19][20] Group 1: Achievements and Recognition - DeepSeek's R1 model is the first large language model to undergo peer review, marking a significant milestone in the field [5] - The model has garnered 3,596 citations on Google Scholar and has been downloaded 10.9 million times from Hugging Face, indicating its widespread acceptance and use [7] - The training cost of R1 is approximately $294,000, significantly lower than competitors that often exceed $10 million, challenging the notion that high investment is necessary for top-tier AI models [12][13] Group 2: Training and Data - R1 was trained using 512 H800 GPUs for 198 hours, with a total training cost of $294,000 [10][11] - The dataset for R1 includes five types of data: Math, Code, STEM, Logic, and General, with a total of 126,000 prompts [15][18] - The model's training involved a combination of cold-start data, reinforcement learning, and supervised fine-tuning, enhancing its reasoning capabilities [25][26] Group 3: Performance Metrics - DeepSeek-R1-Zero achieved a pass@1 score of 71.0% in AIME 2024, significantly improving from 15.6% [21] - In comparison to other leading models, DeepSeek-R1 demonstrated competitive performance across various benchmarks, including MATH-500 and LiveCode [23][30] - The distilled models from DeepSeek-R1 outperformed direct applications of reinforcement learning on the base model, showcasing the effectiveness of the training approach [29] Group 4: Safety and Transparency - DeepSeek has released a detailed safety assessment of the R1 model, indicating a moderate inherent safety level comparable to GPT-4o [18][22] - The company has embraced transparency by open-sourcing the model weights for DeepSeek-R1 and DeepSeek-R1-Zero on Hugging Face, promoting community engagement [30]
DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
3 6 Ke· 2025-09-18 00:45
太令人意外! 却又实至名归! 最新一期的 Nature 封面,竟然是 DeepSeek-R1 的研究。 也就是今年 1 月份 DeepSeek 在 arxiv 公布的论文《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》。这篇Nature论文 通讯作者正是梁文锋。 论文链接: https://www.nature.com/articles/s41586-025-09422-z 在封面的推荐介绍中,Nature 写到: 如果训练出的大模型能够规划解决问题所需的步骤,那么它们往往能够更好地解决问题。这种『推理』与人类处理更复杂问题的方式类似,但 这对人工智能有极大挑战,需要人工干预来添加标签和注释。在本周的期刊中,DeepSeek 的研究人员揭示了他们如何能够在极少的人工输入 下训练一个模型,并使其进行推理。 DeepSeek-R1 模型采用强化学习进行训练。在这种学习中,模型正确解答数学问题时会获得高分奖励,答错则会受到惩罚。结果,它学会了推 理——逐步解决问题并揭示这些步骤——更有可能得出正确 ...
刚刚,梁文锋发Nature了
3 6 Ke· 2025-09-17 23:43
昨晚,DeepSeek再度开创历史! 智东西9月18日报道,9月17日,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研究论文,登上了国际权威期刊《自 然(Nature)》的封面。 DeepSeek-R1论文首次公开了仅靠强化学习,就能激发大模型推理能力的重要研究成果,启发全球AI研究者;这一模型还成为全球最受欢迎的 开源推理模型,Hugging Face下载量超1090万次。此番获得《自然》的认证,可谓是实至名归。 与此同时,DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。《自然》在社论中高度评价道:几乎所有主流的大模型都还没有经过 独立同行评审,这一空白"终于被DeepSeek打破"。 《自然》认为,在AI行业中,未经证实的说法和炒作已经"司空见惯",而DeepSeek所做的一切,都是"迈向透明度和可重复性的可喜一步"。 《自然》杂志封面标题:自助——强化学习教会大模型自我改进 发表在《自然》杂志的新版DeepSeek-R1论文,与今年1月未经同行评审的初版有较大差异,披露了更多模型训练的细节,并正面回应了模型 发布之初的蒸馏质疑。 | https:// ...
《Science Robotics》封面:DeepMind发布RoboBallet,重新定义多机器人协同规划
机器人大讲堂· 2025-09-17 11:13
Core Viewpoint - Multi-robot systems are becoming a key technology for improving production efficiency in modern industrial manufacturing, but face significant challenges in coordinating multiple robots in shared environments [1][4]. Group 1: Challenges in Multi-Robot Coordination - Three core sub-problems must be solved for effective multi-robot coordination: motion planning, task scheduling, and task assignment, each presenting significant computational challenges [3][4]. - Motion planning requires collision-free path planning for each robot, which becomes exponentially complex as the number of robots and obstacles increases [3]. - Task scheduling is akin to the classic Traveling Salesman Problem, with computational complexity that escalates with the number of tasks [3]. - Task assignment involves determining which robot performs which task, with costs dependent on other tasks' assignments, creating a coupled relationship among the three sub-problems [3][4]. Group 2: RoboBallet Framework - RoboBallet is a novel framework developed by engineers from University College London and Google DeepMind, combining Graph Neural Networks (GNN) and Reinforcement Learning (RL) to automate the resolution of multi-robot coordination issues [4][5]. - The framework represents the collaborative scene as a dynamic graph, where nodes represent individual robots and edges denote their interactions based on spatial proximity [5]. - GNN efficiently processes structured information, allowing the model to generalize well to unseen configurations of obstacles and tasks [5]. Group 3: Training and Performance - RoboBallet employs a fine-tuned TD3 algorithm for training the policy network, enabling the generation of multi-robot trajectories while addressing task assignment, scheduling, and motion planning [7]. - The reward mechanism includes task completion rewards and collision penalties, promoting efficient task execution while avoiding collisions [7]. - The model is trained in randomly generated environments, allowing it to learn effective coordination strategies through millions of interactions [7][9]. Group 4: Computational Efficiency and Scalability - RoboBallet demonstrates impressive computational efficiency, achieving planning steps in approximately 0.3 milliseconds even with a maximum configuration of 8 robots, 40 tasks, and 30 obstacles [8]. - The framework's inference time scales linearly with the number of robots, tasks, and obstacles, making it feasible for real-time applications [11]. - Increasing the number of robots significantly enhances task execution efficiency, with average execution time dropping from 7.5 seconds to 4.5 seconds (a 40% reduction) when the number of robots is increased from 4 to 8 [12].
X @s4mmy
s4mmy· 2025-09-15 18:17
AI Development - Meta unveiled a 25x faster method for training AI using Reinforcement Learning (RL) [1] - AI and Robotics are expected to gain traction as models evolve [1] Industry Trends - The DeAI/DePAI segment is gaining momentum [1]
X @s4mmy
s4mmy· 2025-09-15 15:49
Find this useful? Give it a like & share with friendsWant more? I produce a FREE weekly newsletter on Tuesdays; link in bioDisclaimer: I hold investments & have existing partnerships with some of the Agents/protocols mentioned abovehttps://t.co/MdbljS5vvus4mmy (@S4mmyEth):Meta just revealed a 25x faster method of training AI using Reinforcement Learning (RL)AI and Robotics will continue to gain traction as models evolveHere's the roundup for the DeAI/DePAI segment this week🧵(1/9) https://t.co/nICwXpLNNB ...
X @s4mmy
s4mmy· 2025-09-15 15:47
Meta just revealed a 25x faster method of training AI using Reinforcement Learning (RL)AI and Robotics will continue to gain traction as models evolveHere's the roundup for the DeAI/DePAI segment this week🧵(1/9) https://t.co/nICwXpLNNB ...
Alphabet's Isomorphic Labs: Turning Cancer Into a Chronic, But Livable Disease
Youtube· 2025-09-14 06:00
Core Insights - The company is developing a drug design engine that utilizes advanced AI models to create new molecule designs for various diseases and modalities, significantly improving the drug discovery process [2][3][10] - The approach leverages generative AI and predictive capabilities to understand protein structures and interactions, aiming to enhance the efficacy and safety of drug candidates [5][6][12] - The focus is on generalizability, allowing the models to be applied across different targets and disease areas, which is a more ambitious and challenging goal compared to traditional drug design methods [27][30][54] Group 1 - The drug design engine incorporates multiple AI models, including those for predicting protein structures and binding affinities, to streamline the drug development process [3][4][6] - Traditional drug design is iterative and time-consuming, often taking weeks or months for each molecule, whereas the new approach allows for virtual testing and rapid iterations [8][10] - The company aims to reduce the drug discovery timeline significantly, potentially achieving experimental-level accuracy in predictions, which would minimize reliance on physical lab work [47][49] Group 2 - The focus on immunology and oncology is strategic, as these areas have significant clinical impact and allow for more tractable clinical trials [33][34] - The company is making progress in identifying novel chemical matter for previously challenging targets, demonstrating the effectiveness of their AI-driven approach [44][45] - The ambition is to create a generalizable technology that can be reused across various drug design campaigns, which is rare in the biotech industry [54][55] Group 3 - The company is actively working on partnerships with major pharmaceutical firms like Novartis and Eli Lilly to leverage their expertise and accelerate drug discovery [43][44] - The models can analyze entire families of proteins, enabling a comprehensive understanding of molecular interactions that traditional methods cannot achieve [39][40] - The long-term vision includes a future where AI tools assist in diagnosing and treating diseases, potentially transforming patient interactions with healthcare [50][51]
Meta超级智能实验室新论文陷争议!被指忽略大量前人研究
量子位· 2025-09-12 00:59
Core Viewpoint - Meta's Super Intelligence Lab (MSL) faces controversy over its second paper titled "Language Self-Play For Data-Free Training," which has been criticized for neglecting prior research and lacking innovation [2][25]. Summary by Sections Overview of the Paper - The core idea of the paper is to utilize a method called Language Self-Play (LSP) to enable large language models to self-improve without additional training data [3][4]. - LSP addresses the challenge of large language models' heavy reliance on extensive, high-quality training data, which is often limited [4]. Methodology - LSP designs the learning process as a game framework where the same language model plays two roles in opposition, allowing for data-free training [5]. - In this adversarial process, the challenger generates increasingly difficult questions or commands to lower the expected rewards of the resolver, who must understand and respond to maximize their own rewards, akin to a minimax game [7]. - Unlike traditional adversarial training, LSP allows a single language model to act as both "challenger" and "resolver," using a special "Challenger Prompt" to switch roles [8]. Implementation and Challenges - The research introduces a reinforcement learning technique called GRPO to convert the game into a model training process [9]. - A reward mechanism is established where the challenger's questions target the resolver's weaknesses, driving continuous improvement [10]. - The method is termed Language Self-Play Zero (LSP-Zero), indicating a zero-sum nature [11]. - However, LSP-Zero can sometimes degrade, leading the model to generate meaningless content that scores high due to reward hacking [12]. Enhancements - To mitigate this issue, the researchers incorporated a "self-quality reward" (RQ) into the LSP algorithm, guiding the game towards high-quality interactions for sustainable training [13]. Experimental Results - Experiment 1 compared LSP and LSP-Zero with a traditional data-driven model, showing that LSP methods performed comparably to data-driven approaches and significantly outperformed the original model [18]. - In a dialogue and open instruction dataset, LSP's performance exceeded that of GRPO [18]. - Experiment 2 further trained a model using LSP-Zero and LSP, resulting in an increase in overall win rates from 40.9% to 43.1% [21]. - LSP demonstrated particularly notable improvements on the Vicuna dataset, indicating its effectiveness in continuing to unlock model potential post data-driven training [22][24]. Criticism and Response - Critics argue that MSL's work overlooks significant prior research, with various researchers having conducted similar studies without proper citation [25][26]. - The paper has been described as potentially rehashing older work, raising questions about its originality [30]. - As of now, MSL and the authors have not responded to these criticisms [31].