Workflow
强化学习
icon
Search documents
重磅!DeepSeek 梁文锋论文登上《自然》封面,正面回应蒸馏质疑
程序员的那些事· 2025-09-20 01:10
9 月 18 日,由 DeepSeek 团队共同完成、梁文锋担任通讯作者的 DeepSeek-R1 推理模型研究论文,登上了国际权威期刊《自然(Nature)》的封面。 与今年 1 月发布的 DeepSeek-R1 的初版论文相比,本次论文披露了更多模型训练的细节,并正面回应了模型发布之初的蒸馏质疑。 DeepSeek-R1 是全球首个经过同行评审的主流大语言模型。目前几乎所有主流的大模型都还没有经过独立同行评审,这一空白"终于被 DeepSeek 打 破"。 在《自然》封面的推荐介绍中,是这样写的: "如果训练出的大模型能够规划解决问题所需的步骤,那么它们往往能够更好地解决问题。这种『推理』与人类处理更复杂问题的方式类似,但这对人工 智能有极大挑战,需要人工干预来添加标签和注释。在本周的期刊中,DeepSeek 的研究人员揭示了他们如何能够在极少的人工输入下训练一个模型,并 使其进行推理。 DeepSeek-R1 模型采用强化学习进行训练。在这种学习中,模型正确解答数学问题时会获得高分奖励,答错则会受到惩罚。结果,它学会了推理——逐 步解决问题并揭示这些步骤——更有可能得出正确答案。这使得 DeepSeek ...
攻克大模型训推差异难题,蚂蚁开源新一代推理模型Ring-flash-2.0
机器之心· 2025-09-19 10:43
Core Viewpoint - The article discusses the release of Ring-flash-2.0 by Ant Group's Bailing team, highlighting its potential to reshape the competitive landscape of large models by achieving high performance with lower activation parameters and improved training stability [1][4][26]. Performance Overview - Ring-flash-2.0 features a total of 100 billion parameters and 6.1 billion activations, achieving a score of 86.98 in mathematical AIME and an Elo score of 90.23 on CodeForces, with a throughput of over 200 tokens per second [1][21]. - The model's performance is comparable to state-of-the-art (SOTA) levels of 40 billion dense models, demonstrating significant advancements in reasoning tasks [1][21]. Technical Innovations - The introduction of the icepop algorithm allows for stable long-term reinforcement learning (RL) training by freezing tokens with large discrepancies in training and inference accuracy, preventing gradient backpropagation [6][10][13]. - The two-staged RL approach combines supervised fine-tuning (SFT) with reinforcement learning using verifiable rewards (RLVR) and human feedback (RLHF), optimizing the training process [14][16]. Cost Efficiency - Ring-flash-2.0 achieves a performance equivalent to a 40 billion dense model while only activating 6.1 billion parameters, marking a turning point in cost efficiency within the large model competition [17][21]. - The model's design allows for high sparsity and low activation, significantly reducing inference costs in high-concurrency scenarios [21]. Market Implications - The competitive landscape for large models is shifting from a focus on parameter quantity to cost-effectiveness, with Ring-flash-2.0 positioned as a leading solution in this new era [18][25]. - The article suggests that Ring-flash-2.0 may signify the beginning of a "high cost-performance era" in the field of large models, following the advancements initiated by GPT-4 [26].
具身的这几个方向,组成了所谓的大小脑算法
具身智能之心· 2025-09-19 00:03
Core Viewpoint - The article discusses the evolution and current trends in embodied intelligence technology, emphasizing the integration of various models and techniques to enhance robotic capabilities in real-world environments [3][10]. Group 1: Technology Development Stages - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection to behavior cloning, and now to diffusion policy and VLA models [7][10]. - The first stage focused on static object grasping with limited decision-making capabilities [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations but faced challenges in generalization and error accumulation [7]. - The third stage, marked by the introduction of diffusion policy methods, improved stability and generalization by modeling action sequences [8]. - The fourth stage, beginning in 2025, explores the integration of VLA models with reinforcement learning and world models to enhance predictive capabilities and multi-modal perception [9][10]. Group 2: Key Technologies and Techniques - Key technologies in embodied intelligence include VLA, diffusion policy, and reinforcement learning, which collectively enhance robots' task execution and adaptability [5][10]. - VLA models combine visual perception, language understanding, and action generation, enabling robots to interpret human commands and perform complex tasks [8]. - The integration of tactile sensing with VLA models expands the sensory capabilities of robots, allowing for more precise operations in unstructured environments [10]. Group 3: Industry Implications and Opportunities - The advancements in embodied intelligence are leading to increased demand for engineering and system capabilities, transitioning from theoretical research to practical deployment [10][14]. - There is a growing interest in training and deploying various models, including diffusion policy and VLA, on platforms like Mujoco and IsaacGym [14]. - The industry is witnessing a surge in job opportunities and research interest, prompting many professionals to shift focus towards embodied intelligence [10].
纯视觉最新SOTA!AdaThinkDrive:更灵活的自动驾驶VLA思维链(清华&小米)
自动驾驶之心· 2025-09-18 23:33
Core Viewpoint - The article discusses the limitations of existing Chain-of-Thought (CoT) reasoning methods in Vision-Language-Action (VLA) models for autonomous driving, particularly in simple scenarios where they do not improve decision quality and introduce unnecessary computational overhead. It introduces AdaThinkDrive, a new VLA framework that employs a dual-mode reasoning mechanism inspired by the "fast and slow thinking" theory, allowing the model to adaptively choose when to reason based on scene complexity [3][4][10]. Group 1: Introduction and Background - The shift from traditional modular approaches to end-to-end architectures in autonomous driving systems is highlighted, noting that while modular methods offer flexibility, they suffer from information loss between components, leading to cumulative errors in complex scenarios. End-to-end methods mitigate this issue but are still limited by their reliance on supervised data [7]. - The article categorizes current VLA methods into two paradigms: meta-action methods focusing on high-level guidance and planning-based methods that predict trajectories directly from raw inputs. The application of CoT techniques is becoming more prevalent, particularly in complex scenarios, but their effectiveness in simple scenarios is questioned [14][15]. Group 2: AdaThinkDrive Framework - AdaThinkDrive is proposed as an end-to-end VLA framework that incorporates a "fast answer/slow thinking" mechanism, allowing the model to switch adaptively between direct prediction and explicit reasoning based on scene complexity. This is achieved through a three-stage adaptive reasoning strategy [11][18]. - The framework's performance is validated through extensive experiments on the Navsim benchmark, achieving a Predictive Driver Model Score (PDMS) of 90.3, which is 1.7 points higher than the best pure visual baseline model. The model demonstrates superior adaptive reasoning capabilities, selectively enabling CoT in 96% of complex scenarios and defaulting to direct trajectory prediction in 84% of simple scenarios [4][18][50]. Group 3: Experimental Results and Analysis - The article presents a comprehensive evaluation of AdaThinkDrive against existing models, showing that it outperforms both "always think" and "never think" baseline models, with PDMS improvements of 2.0 and 1.4 points, respectively. Additionally, the reasoning time is reduced by 14% compared to the "always think" baseline, indicating a balance between accuracy and efficiency [4][18][58]. - The results indicate that the optimal reasoning strategy is not universal but depends on scene complexity, emphasizing the need for models to adaptively enable reasoning based on the context [10][18]. Group 4: Conclusion - The article concludes that reasoning in simple scenarios often increases computational costs without enhancing decision quality. AdaThinkDrive addresses this by allowing agents to learn when to think, guided by an adaptive thinking reward mechanism. The experimental results on the NAVSIM benchmark demonstrate that AdaThinkDrive achieves state-of-the-art performance, underscoring the importance of adaptive thinking for accurate and efficient decision-making in autonomous driving systems [66].
华人学者一天发表了11篇Nature论文
生物世界· 2025-09-18 10:05
Core Insights - On September 17, 2025, a total of 24 papers were published in the prestigious journal Nature, with 10 of them authored by Chinese scholars, highlighting the significant contribution of Chinese researchers to global scientific advancements [2][5][7][9][12][14][16][18][21]. Group 1: Research Contributions - A paper titled "Toughened self-assembled monolayers for durable perovskite solar cells" was co-authored by scholars from Hong Kong City University and the Chinese Academy of Sciences, focusing on enhancing the durability of perovskite solar cells [2]. - Another significant paper, "A movable long-term implantable soft microfibre for dynamic bioelectronics," was published by researchers from the Chinese Academy of Sciences and Donghua University, contributing to the field of bioelectronics [5]. - The paper "Atomic-scale imaging of frequency-dependent phonon anisotropy" was authored by researchers from the University of California, Irvine, providing insights into phonon behavior at the atomic level [7]. - A study titled "Covariation mass spectrometry uncovers a protein that controls cysteine catabolism" was led by a researcher from Dana-Farber Cancer Institute, revealing important findings in protein metabolism [9]. - The research "A room temperature rechargeable all-solid-state hydride ion battery" was published by scholars from the Dalian Institute of Physical Chemistry, focusing on advancements in battery technology [12]. - A paper on "High-density soft bioelectronic fibres for multimodal sensing and stimulation" was authored by researchers from Stanford University, contributing to the development of bioelectronic devices [14]. - The study "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning" was published by DeepSeek, exploring advancements in large language models [16]. - A paper titled "Structural basis for mTORC1 activation on the lysosomal membrane" was authored by researchers from the University of California, Berkeley, providing insights into cellular signaling mechanisms [17]. - The research "Peroxisomal metabolism of branched fatty acids regulates energy homeostasis" was published by scholars from Washington University in St. Louis, contributing to the understanding of metabolic processes [18]. - A study on "Delta-type glutamate receptors are ligand-gated ion channels" was published by Johns Hopkins University, enhancing knowledge in neurobiology [21].
DeepSeek首次回应“蒸馏OpenAI”质疑
第一财经· 2025-09-18 05:34
Core Viewpoint - DeepSeek's R1 model has gained significant attention after being published in the prestigious journal "Nature," showcasing its ability to enhance reasoning capabilities through reinforcement learning without relying heavily on supervised data [3][11]. Group 1: Model Development and Training - The training cost for the DeepSeek-R1 model was approximately $294,000, with specific costs for different components detailed as follows: R1-Zero training cost was $202,000, SFT dataset creation cost was $10,000, and R1 training cost was $82,000 [10]. - DeepSeek-R1 utilized 64×8 H800 GPUs for training, taking about 198 hours for R1-Zero and around 80 hours for R1 [10]. - The total training cost, including the earlier V3 model, remains significantly lower than competitors, totaling around $6 million for V3 and $294,000 for R1 [10]. Group 2: Model Performance and Validation - DeepSeek's approach allows for significant performance improvements in reasoning capabilities through large-scale reinforcement learning, even without supervised fine-tuning [13]. - The model's ability to self-validate and reflect on its answers enhances its performance on complex programming and scientific problems [13]. - The research indicates that the R1 model has become the most popular open-source reasoning model globally, with over 10.9 million downloads on Hugging Face [10]. Group 3: Industry Impact and Peer Review - The publication of the R1 model in "Nature" sets a precedent for transparency in AI research, addressing concerns about the reliability of benchmark tests and the potential for manipulation [11]. - The research emphasizes the importance of independent peer review in validating the capabilities of AI systems, which is crucial in an industry facing scrutiny over performance claims [11].
DeepSeek首次回应“蒸馏OpenAI”质疑
Di Yi Cai Jing· 2025-09-18 04:34
Core Insights - DeepSeek's research paper, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," has been published in the prestigious journal Nature, highlighting significant advancements in AI reasoning capabilities [1][11]. Group 1: Research and Development - The initial version of DeepSeek's paper was released on arXiv in January, and the Nature publication includes more detailed model specifications and reduced anthropomorphism in descriptions [5]. - DeepSeek-R1's training cost was reported to be $294,000, with specific costs for different components outlined, including $202,000 for DeepSeek-R1-Zero training and $82,000 for SFT data creation [9]. - The training utilized A100 GPUs for smaller models and expanded to 660 billion parameters for the R1 model, demonstrating a scalable approach to model development [8][10]. Group 2: Model Performance and Validation - DeepSeek-R1 has become the most popular open-source inference model globally, with over 10.9 million downloads on Hugging Face, marking it as the first mainstream large language model to undergo peer review [11]. - The research emphasizes that significant reasoning capabilities can be achieved through reinforcement learning without relying heavily on supervised fine-tuning, which is a departure from traditional methods [13]. - The model's training involved a reward mechanism that encourages correct reasoning, allowing it to self-validate and improve its performance on complex tasks [13]. Group 3: Industry Implications - The findings from DeepSeek's research could set a precedent for future AI model development, particularly in enhancing reasoning capabilities without extensive data requirements [11][13]. - The independent peer review process adds credibility to the model's performance claims, addressing concerns about potential manipulation in AI benchmarking [11].
开源Agent模型榜第一名,现在是阿里通义DeepResearch
量子位· 2025-09-18 04:20
Core Viewpoint - Alibaba has open-sourced its first deep research agent model, Tongyi DeepResearch, which outperforms existing models like OpenAI's Deep Research and DeepSeek-V3.1 in various authoritative evaluation sets [1][3]. Data Strategy - The model's capability enhancement is attributed to a multi-stage data strategy designed to generate high-quality training data without relying on expensive manual annotations [4][5]. - The team introduced Agentic CPT for incremental pre-training, establishing a solid foundation for the agent [6]. - A systematic and scalable data synthesis scheme was developed to create a positive feedback loop for data generation [7]. Data Construction - An open-world knowledge memory was constructed using a wide range of knowledge documents, web crawler data, knowledge graphs, and trajectory data from post-training [8]. - Three types of action data were created based on diverse question styles and historical trajectory data, enabling extensive exploration of the reasoning-action space [9]. Post-training Data - The team developed a fully automated synthetic data generation scheme to produce datasets that surpass the quality of manual annotations [11][12]. - A new process was designed to extract information from real website data, ensuring the authenticity of data structures while increasing question complexity [14]. Reasoning Modes - Tongyi DeepResearch features both a native ReAct Mode and a Heavy Mode for handling complex multi-step research tasks [15][18]. - The IterResearch paradigm was created to deconstruct tasks into a series of research rounds, allowing the agent to maintain cognitive focus and high-quality reasoning [20]. Training Process - The training process was innovated to connect Agentic CPT, Agentic SFT, and Agentic RL, leading to a new paradigm for agent model training [25][27]. - The team emphasized the importance of data quality and training environment stability over algorithmic factors in the success of reinforcement learning projects [37][39]. Application Deployment - Tongyi DeepResearch has empowered multiple internal applications within Alibaba, including the Gaode travel agent, which integrates complex query capabilities into its app [42][43]. - A simulated training environment was created to address the high costs and inconsistencies associated with real-time web API development [44]. Legal AI Application - Tongyi Law Rui, a legal AI agent, aims to provide professional legal services, leveraging innovative agent architecture and iterative planning technology for complex reasoning tasks [46].
“这一空白终于被打破”,梁文锋论文登上《自然》封面
Guan Cha Zhe Wang· 2025-09-18 03:27
《科技日报》则在报道中介绍称,梁文锋参与的研究表明,大语言模型的推理能力可通过纯强化学习来 提升,从而减少增强性能所需的人类输入工作量。训练出的模型在数学和STEM领域研究生水平问题等 任务上,比传统训练的大语言模型表现更好。 DeepSeek-R1包含一个在人类监督下的深入训练阶段,以优化推理过程。梁文锋团队报告称,该模型使 用了强化学习而非人类示例来开发推理步骤,减少了训练成本和复杂性。DeepSeek-R1在被展示优质的 问题解决案例后,会获得一个模板来产生推理过程,即这一模型通过解决问题获得奖励,从而强化学习 效果。在评估AI表现的各项测试中,DeepSeek-R1-Zero和DeepSeek-R1的表现都十分优异。 据智通财经9月18日消息,由DeepSeek团队共同完成、梁文锋担任通讯作者的DeepSeek-R1推理模型研 究论文,登上了国际权威期刊《自然(Nature)》的封面。 与今年1月发布的DeepSeek-R1的初版论文相比,本次论文披露了更多模型训练的细节,并正面回应了 模型发布之初的蒸馏质疑。DeepSeek-R1也是全球首个经过同行评审的主流大语言模型。Nature评价 道:目前几 ...
DeepSeek论文登上《自然》封面,R1成为首个严格学术审查大模型
Xin Lang Cai Jing· 2025-09-18 02:23
Core Insights - DeepSeek's R1 model has been recognized as the first major language model to be peer-reviewed and published in the prestigious journal Nature, marking a significant milestone in AI research [1][2] - The R1 model achieved over 10.9 million downloads on Hugging Face, making it the most popular open-source inference model globally [2] - DeepSeek's innovative approach utilizes pure reinforcement learning to enhance reasoning capabilities, diverging from traditional human-imitation methods [2][3] Company Developments - DeepSeek's R1 model was developed with a training cost of only $294,000, significantly lower than the costs associated with training AI models by OpenAI and Google, which can reach millions [2] - The company released an upgraded version, DeepSeek-V3.1, which features a mixed reasoning architecture, improved thinking efficiency, and enhanced agent capabilities [3] - DeepSeek was founded in 2023 in Hangzhou, backed by the quantitative firm Huansquare, with a team composed of experts from top universities and international institutions [3] Industry Context - The publication of DeepSeek's research is seen as a critical step in addressing the rampant speculation and unverified claims within the AI industry, emphasizing the importance of independent peer review [3] - The recognition of DeepSeek's work by Nature highlights China's advancements in foundational research in large models, contributing to the global AI landscape [2]