Reinforcement Learning
Search documents
让机器人「不只是走路」,Nav-R1引领带推理的导航新时代
机器之心· 2025-09-18 01:01
Core Insights - The article discusses the challenges in enabling robots to understand and execute complex navigation commands in real-world environments, emphasizing the need for improved reasoning, path planning, and action execution capabilities [2][4]. Group 1: Key Innovations - The paper introduces a new foundational model called Nav-R1, which integrates perception, reasoning, and action in 3D environments, enhancing the robot's ability to think clearly before acting [5]. - A large dataset, Nav-CoT-110K, consisting of approximately 110,000 Chain-of-Thought trajectories, is constructed to facilitate cold-start training, allowing the model to learn reasoning and action decision-making before reinforcement learning optimization [8]. - Nav-R1 employs three complementary reward mechanisms during reinforcement learning: Format Reward, Understanding Reward, and Navigation Reward, which collectively enhance the model's logical behavior and alignment with human expectations [9][13]. Group 2: Experimental Results - Nav-R1 demonstrates significant improvements in success rates and path efficiency across various navigation tasks, achieving approximately an 8% increase compared to other advanced methods [14]. - In real-world experiments, Nav-R1 was tested on a mobile robot platform, showing robust performance in navigating complex indoor environments such as meeting rooms and corridors [18][23]. Group 3: Practical Applications - The capabilities of Nav-R1 suggest potential applications in service robots and home assistants, where understanding and navigating cluttered environments is crucial for user experience [29]. - In healthcare settings, Nav-R1 can enhance the navigation of robots in hospitals and nursing homes, ensuring safe and reliable operation in complex environments [30]. - The model's reasoning and control capabilities are also applicable in augmented reality (AR) and virtual reality (VR) scenarios, where virtual agents need to navigate physical spaces [31]. - In industrial and hazardous environments, Nav-R1's robustness and generalization abilities make it suitable for tasks in factories, mines, and disaster sites [32].
中国大模型首登Nature封面!DeepSeek首次披露:R1训练只花了200万
量子位· 2025-09-18 00:51
Core Insights - DeepSeek has become the first Chinese large model company to be featured on the cover of Nature, with founder Liang Wenfeng as the corresponding author [2][3] - The R1 model has been recognized for its innovative approach, achieving significant performance improvements in reasoning tasks through a pure reinforcement learning framework [19][20] Group 1: Achievements and Recognition - DeepSeek's R1 model is the first large language model to undergo peer review, marking a significant milestone in the field [5] - The model has garnered 3,596 citations on Google Scholar and has been downloaded 10.9 million times from Hugging Face, indicating its widespread acceptance and use [7] - The training cost of R1 is approximately $294,000, significantly lower than competitors that often exceed $10 million, challenging the notion that high investment is necessary for top-tier AI models [12][13] Group 2: Training and Data - R1 was trained using 512 H800 GPUs for 198 hours, with a total training cost of $294,000 [10][11] - The dataset for R1 includes five types of data: Math, Code, STEM, Logic, and General, with a total of 126,000 prompts [15][18] - The model's training involved a combination of cold-start data, reinforcement learning, and supervised fine-tuning, enhancing its reasoning capabilities [25][26] Group 3: Performance Metrics - DeepSeek-R1-Zero achieved a pass@1 score of 71.0% in AIME 2024, significantly improving from 15.6% [21] - In comparison to other leading models, DeepSeek-R1 demonstrated competitive performance across various benchmarks, including MATH-500 and LiveCode [23][30] - The distilled models from DeepSeek-R1 outperformed direct applications of reinforcement learning on the base model, showcasing the effectiveness of the training approach [29] Group 4: Safety and Transparency - DeepSeek has released a detailed safety assessment of the R1 model, indicating a moderate inherent safety level comparable to GPT-4o [18][22] - The company has embraced transparency by open-sourcing the model weights for DeepSeek-R1 and DeepSeek-R1-Zero on Hugging Face, promoting community engagement [30]
DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
3 6 Ke· 2025-09-18 00:45
太令人意外! 却又实至名归! 最新一期的 Nature 封面,竟然是 DeepSeek-R1 的研究。 也就是今年 1 月份 DeepSeek 在 arxiv 公布的论文《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》。这篇Nature论文 通讯作者正是梁文锋。 论文链接: https://www.nature.com/articles/s41586-025-09422-z 在封面的推荐介绍中,Nature 写到: 如果训练出的大模型能够规划解决问题所需的步骤,那么它们往往能够更好地解决问题。这种『推理』与人类处理更复杂问题的方式类似,但 这对人工智能有极大挑战,需要人工干预来添加标签和注释。在本周的期刊中,DeepSeek 的研究人员揭示了他们如何能够在极少的人工输入 下训练一个模型,并使其进行推理。 DeepSeek-R1 模型采用强化学习进行训练。在这种学习中,模型正确解答数学问题时会获得高分奖励,答错则会受到惩罚。结果,它学会了推 理——逐步解决问题并揭示这些步骤——更有可能得出正确 ...
刚刚,梁文锋发Nature了
3 6 Ke· 2025-09-17 23:43
昨晚,DeepSeek再度开创历史! 智东西9月18日报道,9月17日,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研究论文,登上了国际权威期刊《自 然(Nature)》的封面。 DeepSeek-R1论文首次公开了仅靠强化学习,就能激发大模型推理能力的重要研究成果,启发全球AI研究者;这一模型还成为全球最受欢迎的 开源推理模型,Hugging Face下载量超1090万次。此番获得《自然》的认证,可谓是实至名归。 与此同时,DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。《自然》在社论中高度评价道:几乎所有主流的大模型都还没有经过 独立同行评审,这一空白"终于被DeepSeek打破"。 《自然》认为,在AI行业中,未经证实的说法和炒作已经"司空见惯",而DeepSeek所做的一切,都是"迈向透明度和可重复性的可喜一步"。 《自然》杂志封面标题:自助——强化学习教会大模型自我改进 发表在《自然》杂志的新版DeepSeek-R1论文,与今年1月未经同行评审的初版有较大差异,披露了更多模型训练的细节,并正面回应了模型 发布之初的蒸馏质疑。 | https:// ...
《Science Robotics》封面:DeepMind发布RoboBallet,重新定义多机器人协同规划
机器人大讲堂· 2025-09-17 11:13
Core Viewpoint - Multi-robot systems are becoming a key technology for improving production efficiency in modern industrial manufacturing, but face significant challenges in coordinating multiple robots in shared environments [1][4]. Group 1: Challenges in Multi-Robot Coordination - Three core sub-problems must be solved for effective multi-robot coordination: motion planning, task scheduling, and task assignment, each presenting significant computational challenges [3][4]. - Motion planning requires collision-free path planning for each robot, which becomes exponentially complex as the number of robots and obstacles increases [3]. - Task scheduling is akin to the classic Traveling Salesman Problem, with computational complexity that escalates with the number of tasks [3]. - Task assignment involves determining which robot performs which task, with costs dependent on other tasks' assignments, creating a coupled relationship among the three sub-problems [3][4]. Group 2: RoboBallet Framework - RoboBallet is a novel framework developed by engineers from University College London and Google DeepMind, combining Graph Neural Networks (GNN) and Reinforcement Learning (RL) to automate the resolution of multi-robot coordination issues [4][5]. - The framework represents the collaborative scene as a dynamic graph, where nodes represent individual robots and edges denote their interactions based on spatial proximity [5]. - GNN efficiently processes structured information, allowing the model to generalize well to unseen configurations of obstacles and tasks [5]. Group 3: Training and Performance - RoboBallet employs a fine-tuned TD3 algorithm for training the policy network, enabling the generation of multi-robot trajectories while addressing task assignment, scheduling, and motion planning [7]. - The reward mechanism includes task completion rewards and collision penalties, promoting efficient task execution while avoiding collisions [7]. - The model is trained in randomly generated environments, allowing it to learn effective coordination strategies through millions of interactions [7][9]. Group 4: Computational Efficiency and Scalability - RoboBallet demonstrates impressive computational efficiency, achieving planning steps in approximately 0.3 milliseconds even with a maximum configuration of 8 robots, 40 tasks, and 30 obstacles [8]. - The framework's inference time scales linearly with the number of robots, tasks, and obstacles, making it feasible for real-time applications [11]. - Increasing the number of robots significantly enhances task execution efficiency, with average execution time dropping from 7.5 seconds to 4.5 seconds (a 40% reduction) when the number of robots is increased from 4 to 8 [12].
X @s4mmy
s4mmy· 2025-09-15 15:49
Find this useful? Give it a like & share with friendsWant more? I produce a FREE weekly newsletter on Tuesdays; link in bioDisclaimer: I hold investments & have existing partnerships with some of the Agents/protocols mentioned abovehttps://t.co/MdbljS5vvus4mmy (@S4mmyEth):Meta just revealed a 25x faster method of training AI using Reinforcement Learning (RL)AI and Robotics will continue to gain traction as models evolveHere's the roundup for the DeAI/DePAI segment this week🧵(1/9) https://t.co/nICwXpLNNB ...
Alphabet's Isomorphic Labs: Turning Cancer Into a Chronic, But Livable Disease
Youtube· 2025-09-14 06:00
Core Insights - The company is developing a drug design engine that utilizes advanced AI models to create new molecule designs for various diseases and modalities, significantly improving the drug discovery process [2][3][10] - The approach leverages generative AI and predictive capabilities to understand protein structures and interactions, aiming to enhance the efficacy and safety of drug candidates [5][6][12] - The focus is on generalizability, allowing the models to be applied across different targets and disease areas, which is a more ambitious and challenging goal compared to traditional drug design methods [27][30][54] Group 1 - The drug design engine incorporates multiple AI models, including those for predicting protein structures and binding affinities, to streamline the drug development process [3][4][6] - Traditional drug design is iterative and time-consuming, often taking weeks or months for each molecule, whereas the new approach allows for virtual testing and rapid iterations [8][10] - The company aims to reduce the drug discovery timeline significantly, potentially achieving experimental-level accuracy in predictions, which would minimize reliance on physical lab work [47][49] Group 2 - The focus on immunology and oncology is strategic, as these areas have significant clinical impact and allow for more tractable clinical trials [33][34] - The company is making progress in identifying novel chemical matter for previously challenging targets, demonstrating the effectiveness of their AI-driven approach [44][45] - The ambition is to create a generalizable technology that can be reused across various drug design campaigns, which is rare in the biotech industry [54][55] Group 3 - The company is actively working on partnerships with major pharmaceutical firms like Novartis and Eli Lilly to leverage their expertise and accelerate drug discovery [43][44] - The models can analyze entire families of proteins, enabling a comprehensive understanding of molecular interactions that traditional methods cannot achieve [39][40] - The long-term vision includes a future where AI tools assist in diagnosing and treating diseases, potentially transforming patient interactions with healthcare [50][51]
Meta超级智能实验室新论文陷争议!被指忽略大量前人研究
量子位· 2025-09-12 00:59
henry 发自 凹非寺 量子位 | 公众号 QbitAI 究竟是啥论文? 让模型在博弈中学习 总的来说,MSL这篇新论文的核心思想是通过一种 Language Self-Play (LSP)的方法,让大型语言模型 在没有额外训练数据的情况下实 现自我提升 。 这一方法旨在应对当前大语言模型高度依赖大规模、高质量训练数据,且训练数据有限所带来的困境。 为此,LSP将模型的学习过程设计成一个博弈框架,让同一个语言模型扮演两个角色进行对抗,从而实现无数据训练。 Meta超级智能实验室(MSL)又被送上争议的风口浪尖了。 不过,这次不是人事风波,而是他们的 第二篇 论文《Language Self-Play For Data-Free Training》被质疑 忽视前人研究、缺乏创新 。 具体来说,这两个角色分别是: 在对抗过程中,挑战者不断生成越来越刁钻的问题或指令,以降低解决者的预期回报;而解决者则必须努力理解并回答这些指令,以最大化自 身回报——这其实就是我们熟悉的极小极大博弈(minimax game)。 通过这样的对抗训练,模型能够在不断博弈中持续改进,逐步提升能力。 此外,与传统对抗训练不同,LSP让 ...