Reinforcement Learning
Search documents
X @Herbert Ong
Herbert Ong· 2025-09-29 12:48
FSD (Full Self-Driving) Version 14 Key Features - Version 14 model is 10x larger than version 13, leveraging more data and reinforcement learning (RL) use cases [1] - Version 14 operates in supervised, unsupervised, and unsupervised Robotaxi modes, all software-controlled [2][5] - Version 14 will only run on HW4 vehicles [3] FSD Version 14 Iterative Improvements - Version 14.1 incorporates fine-tuning based on an expanded range of RL use cases, likely from Robotaxi experiences [2] - Version 14.2 further expands RL use cases, potentially leading to a more 'sentient' driving experience [3] Robotaxi Implications - Version 14.2 may enable the removal of safety observers for Robotaxi, subject to regulatory requirements [4] - Dynamic mode switching will occur based on location, with user prompts indicating changes [5]
Z Event|SF Tech Week10.8硅谷线下会:为什么是现在?RL 的转折点与未来
Z Potentials· 2025-09-28 14:29
Core Insights - Reinforcement Learning (RL) is transitioning from a niche area to a critical component in advancing reasoning, decision-making, and complex scene interactions, especially as developments in Large Language Models (LLMs) reach a bottleneck [3] Group 1: Event Overview - An event is scheduled for October 8th at 6:30 PM in San Francisco, featuring top experts from academia, industry, and startups to discuss the future of RL [4] - The event is organized by Z Potentials in collaboration with HatTrick Capital and Future Builderz, focusing on connecting researchers, founders, and investors [8][9] Group 2: Featured Speakers - Notable speakers include Zeng Dong, an Assistant Professor at UCSB and former NVIDIA AI Researcher, who specializes in RL and intelligent decision-making [6] - Qifei Wang, Research Lead at DeepMind, is leading explorations at the intersection of RL and multimodal integration [6] - Bill Zhu, CEO of Pokee AI and former head of Applied RL at Meta, is working on large-scale RL applications in products [6] - Other speakers include Mike Cheng, Andy Lyu, Daanish Khazi, and Robi Lin, who are influential figures in the RL space and represent a blend of research and entrepreneurial efforts [7]
From Vibe Coding to Vibe Researching: OpenAI’s Mark Chen and Jakub Pachocki
a16z· 2025-09-25 13:00
Research & Development Focus - OpenAI is targeting the production of an automated researcher to automate the discovery of new ideas, with a focus on economically relevant advancements [1][3] - The company is extending the reasoning horizon of models, aiming for them to autonomously operate for longer periods, measured by performance in math and programming competitions [3] - OpenAI is working on improving the ability of models to handle more difficult and messy real-world coding environments, focusing on style, proactivity, and latency [12][13] Model Capabilities & Advancements - GPT-5 aims to bring reasoning into the mainstream, improving upon previous models like O3 by delivering reasoning and more agentic behavior by default [1] - The company observed significant progress in models' ability to solve hard science problems, with instances of discovering non-trivial new mathematics [1] - Reinforcement Learning (RL) continues to be a versatile method for continuous improvements, especially when combined with natural language modeling [4][5] Talent & Culture - OpenAI emphasizes fundamental research and innovation, discouraging copying and fostering a culture where researchers are inspired to discover new things [35][36] - The company looks for individuals who have solved hard problems in any field, possessing strong technical fundamentals and the intent to work on ambitious challenges [40] - OpenAI protects fundamental research by delineating researchers focused on algorithmic advances from those focused on product, ensuring space for long-term research questions [46][57] Resource Allocation & Strategy - OpenAI prioritizes core algorithmic advances over product research in compute allocation, but remains flexible to adapt to changing needs [59] - The company believes compute remains a critical resource for advancing AI, not expecting to be data-constrained anytime soon [62][63] - OpenAI acts from a place of strong belief in its long-term research program, not tying it too closely to short-term product reception [70]
X @Elon Musk
Elon Musk· 2025-09-19 13:48
1.21 Gigawatts of training compute!(Actually, slightly more)SangBin Cho (@Saaaang94):We are hiring numerics / quantization expert to scale RL (with @sehoonkim418)! there will be lots of exciting challenges coming with Jax + Sglang + the first Gigawatt cluster in the world (with many hundred thousands of GB200/300)! ...
DeepSeek 创始人梁文锋在《自然》杂志回应质疑,R1 训练真 29.4 万美金
Xin Lang Cai Jing· 2025-09-19 00:03
Core Insights - DeepSeek-R1 has made a significant impact in the AI field by being featured on the cover of Nature, highlighting its innovative approach to enhancing reasoning capabilities in large language models (LLMs) through reinforcement learning (RL) [1][3][5]. Group 1: Achievements and Recognition - The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" was published in January and has now been recognized on the cover of a leading journal, Nature [3]. - DeepSeek-R1 has become the most popular model on Hugging Face after its open-source release, achieving over 10.9 million downloads [5]. - The training cost for DeepSeek-R1 was remarkably low at $294,000, which is significantly less than the costs incurred by competitors like OpenAI and Google [6][7]. Group 2: Training Methodology - DeepSeek-R1 utilizes a novel RL framework that focuses solely on the task format and reward signals based on the correctness of the final answer, allowing for a more organic development of reasoning capabilities [10]. - The model's reasoning accuracy improved dramatically from 15.6% to 77.9% during training, with a peak accuracy of 86.7% when combined with "self-consistent decoding" techniques [10]. Group 3: Self-Evolution and Advanced Strategies - The model exhibited self-evolution behaviors, such as increasing the length of generated text and employing advanced reasoning strategies like self-reflection and systematic exploration of alternative solutions [12][14]. - A notable "Aha Moment" was observed when the model began using the word "wait" more frequently, indicating a shift in its reasoning approach [15][17]. Group 4: Future Development Plans - To address the limitations of DeepSeek-R1, a multi-stage refinement plan has been initiated, which includes cold starting with high-quality conversational data, followed by multiple rounds of RL and supervised fine-tuning [18][19]. - The model's performance has improved by 17%-25% on various benchmarks after undergoing this multi-stage training process [21]. Group 5: Algorithm and Reward System - DeepSeek employs the GRPO (Group Relative Policy Optimization) algorithm, which optimizes model performance by evaluating a group of answers rather than a single best answer, thus reducing resource consumption while maintaining stability [23][24]. - A dual reward system has been established, incorporating both rule-based rewards for reasoning tasks and model-based rewards for general tasks, ensuring the model aligns with human preferences while maintaining its reasoning capabilities [25][26]. Group 6: Challenges and Limitations - Despite its advancements, DeepSeek-R1 faces challenges in structured outputs and tool usage, and it is sensitive to prompts, which limits its effectiveness in complex scenarios [35][37]. - The potential for reward hacking exists, particularly in subjective tasks, which could undermine the model's performance if the reward signals are not robust [37].
xAI 巨像 2 号——全球首个吉瓦级数据中心,独特强化学习方法论及融资计划——半导体分析 --- xAI’s Colossus 2 – First Gigawatt Datacenter In The World, Unique RL Methodology, Capital Raise – SemiAnalysis
2025-09-18 13:09
Summary of xA's Coossus 2 Conference Call Company and Industry Overview - The conference call focuses on xA, a company involved in the development of advanced data centers, specifically the Coossus 2 project, which is positioned as the world's first gigawatt-scale data center [1][2][14]. Key Points and Arguments 1. **Coossus 2 Project Launch**: The Coossus 2 project was initiated on March 7, 2025, with the acquisition of a 1 million square foot warehouse in Memphis and adjacent sites totaling 100 acres [18]. 2. **Cooling Capacity**: By August 22, 2025, xA had installed 11 air-cooled chillers, providing approximately 200 MW of cooling capacity, sufficient to support around 110,000 GB200 NVL72 systems [18]. 3. **Speed of Construction**: xA completed the Coossus 2 project in six months, a significant reduction compared to the 15 months taken by competitors like Oracle, Crusoe, and OpenAI [19]. 4. **Power Infrastructure**: The power infrastructure for Coossus 2 is being developed in Southaven, Mississippi, where xA acquired a former Duke Energy power plant and received temporary approval to operate gas turbines [24][31]. 5. **Partnership with Soaris Energy**: xA has partnered with Soaris Energy Infrastructure, which owns a fleet of 100 MW gas turbines, to enhance power generation capabilities [33][34]. 6. **Future Capacity Plans**: xA aims to scale its power capacity to over 1.5 GW, with plans to deploy additional turbines and infrastructure [40]. 7. **Funding Needs**: The required capital expenditures for Coossus 2 are projected to be in the tens of billions of dollars, raising concerns about xA's ability to generate meaningful external revenue [51]. 8. **Middle East Expansion**: xA is considering large-scale expansion in the Middle East, leveraging existing relationships with regional investors and potential funding sources [56][58]. Additional Important Insights - **Technological Edge**: xA is utilizing unique reinforcement learning methodologies that may allow it to surpass competitors like OpenAI and Anthropic in AI capabilities [14]. - **Internal Revenue Generation**: A significant portion of xA's revenue may come from inter-company transfers, raising questions about the sustainability of its revenue model [67]. - **Investor Sentiment**: There are challenges in justifying xA's valuation, which is nearing $200 billion, especially in comparison to competitors like Anthropic [58]. This summary encapsulates the critical aspects of xA's Coossus 2 project and its strategic positioning within the data center and AI industry, highlighting both opportunities and challenges ahead.
刚刚,梁文锋发Nature了
36氪· 2025-09-18 10:18
Core Viewpoint - DeepSeek's R1 reasoning model has achieved significant recognition by being published in the prestigious journal Nature, marking a milestone in AI research and transparency in the industry [4][22][36]. Group 1: Model Development and Achievements - The DeepSeek-R1 model, developed by Liang Wenfeng's team, is the first mainstream large language model to undergo peer review, breaking a significant gap in the AI industry [4][11][22]. - The model has become the most popular open-source reasoning model globally, with over 10.9 million downloads on Hugging Face [4]. - DeepSeek-R1's research addresses a major issue in AI, enhancing reasoning capabilities through reinforcement learning without relying on extensive human labeling [14][16]. Group 2: Transparency and Peer Review - Nature's editorial highlights the importance of peer-reviewed publications in clarifying how large models work and ensuring their performance aligns with vendor claims [24][25][34]. - The peer review process for DeepSeek-R1 involved eight external experts who provided over a hundred specific comments, enhancing the paper's clarity and credibility [26][29][34]. - DeepSeek's commitment to transparency is evident in the detailed disclosures about model training and safety assessments, which are crucial for mitigating risks associated with AI technologies [11][18][36]. Group 3: Safety and Data Integrity - DeepSeek conducted a comprehensive safety evaluation of the R1 model, demonstrating its superior safety compared to contemporaneous models [11][18]. - The model's training data underwent rigorous decontamination processes to prevent bias and ensure that evaluation results accurately reflect its problem-solving capabilities [17][20]. - Despite acknowledging potential contamination issues in some benchmark tests, DeepSeek has implemented external risk control systems to enhance safety during deployment [18][19]. Group 4: Industry Impact and Future Directions - DeepSeek's open-source model is positioned as a representative of domestic AI technology on the global stage, potentially setting a standard for research transparency in the AI industry [36]. - The call for more AI companies to submit their models for peer review reflects a growing recognition of the need for verified claims and enhanced credibility in AI research [36].
X @外汇交易员
外汇交易员· 2025-09-18 02:30
DeepSeek-R1论文登上Nature期刊封面,提到的是DeepSeek今年1月在arxiv发布的论文《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》,通讯作者为梁文锋。Nature编辑认为,同行评审模式对AI大语言模型发展有益,因为基准测试是可被操控,将模型的设计、方法论和局限性交由独立的外部专家审视,能够有效“挤水分”,抑制AI行业炒作。🗒️DeepSeek-R1被认为是首个通过权威学术期刊同行评审的大语言模型。 ...
DeepSeek-R1登上Nature封面:朝着AI透明化迈出的可喜一步
3 6 Ke· 2025-09-18 02:02
Core Insights - The value of open-source artificial intelligence (AI) is gaining broader recognition, highlighted by the publication of the DeepSeek-R1 paper in the prestigious journal Nature, with founder Liang Wenfeng as the corresponding author [1][5]. Research Findings - The research team hypothesized that human-defined reasoning patterns might limit model exploration, and unrestricted reinforcement learning (RL) training could better stimulate the emergence of new reasoning capabilities in large language models (LLMs) [3][8]. - Experiments demonstrated that the reasoning ability of LLMs can be enhanced through pure RL, reducing the need for human input, and outperforming traditionally trained LLMs in tasks such as mathematics, programming competitions, and graduate-level STEM problems [3][9]. Model Evaluation - Following the launch of DeepSeek-R1, it received widespread acclaim from global developers, achieving 91.1k stars on GitHub [4]. - Nature's editorial recognized DeepSeek-R1 as the first mainstream LLM published after peer review, marking a significant step towards transparency in AI [5][17]. - The editorial emphasized the importance of peer-reviewed publications in clarifying LLM operations and assessing their authenticity [6][17]. Methodology - The research introduced a new paradigm within the RL framework, minimizing reliance on human-annotated reasoning processes and exploring the potential for LLMs to develop reasoning capabilities through self-evolution [9][10]. - The team proposed a RL algorithm called "Group Relative Policy Optimization" (GRPO) and trained various models, including DeepSeek-R1-Zero and DeepSeek-R1, based on the foundational model DeepSeek-V3 Base [10][12]. Training Phases - The training process involved multiple stages, with each subsequent model improving upon the previous one in terms of reasoning and instruction-following capabilities [14]. - DeepSeek-R1 demonstrated strong reasoning abilities aligned with human preferences, achieving superior performance across 21 mainstream benchmarks, validating the effectiveness of the RL framework [15][16]. Industry Implications - The editorial raised concerns about the lack of independent peer review for many widely used LLMs, highlighting the need for transparency and accountability in the AI industry [17][18]. - Nature called for more AI companies to submit their models for publication review, emphasizing that peer review can enhance trust and credibility in AI research [18][19].
DeepSeek登上Nature封面,梁文锋带队回应质疑,R1训练真29.4万美金
3 6 Ke· 2025-09-18 01:32
Core Insights - The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" has gained significant recognition, being featured on the cover of a leading global journal, Nature [2][4] - DeepSeek-R1 is noted as the first mainstream large language model (LLM) to undergo a peer review process, which has set a precedent for transparency in AI development [7] Model Performance and Popularity - After its open-source release, DeepSeek-R1 became the most downloaded model on Hugging Face, surpassing 10.9 million downloads [4] - The model demonstrated a remarkable improvement in reasoning capabilities, achieving an average problem-solving accuracy (pass@1) of 77.9%, and up to 86.7% with "self-consistent decoding" technology [10] Training Costs and Efficiency - The training cost for DeepSeek-R1 was reported at $294,000, significantly lower than the costs incurred by companies like OpenAI and Google [5][6] - The training process involved 147,000 GPU hours, with a breakdown of costs for different training phases [6] Innovative Training Approach - DeepSeek-R1-Zero was developed by completely discarding human reasoning patterns, utilizing a simplified reinforcement learning framework [8][10] - The model was trained with a focus on two main components: task format and reward signals based on the correctness of final answers [10] Self-Evolution and Advanced Reasoning - During training, the model exhibited self-evolution behaviors, increasing the length of generated text in the "think" tag and developing advanced reasoning strategies [12][15] - A notable "Aha Moment" was observed when the model began using the word "wait" more frequently, indicating a shift in its reasoning process [16][18] Multi-Stage Training Process - The training process consists of multiple stages, including cold start, reinforcement learning, large-scale supervised fine-tuning, and a second round of reinforcement learning [19][20] - Each stage is designed to enhance different aspects of the model's capabilities, from initial fine-tuning to improving language consistency and general knowledge [20][35] Reward System Design - DeepSeek implemented a dual-track reward system, combining rule-based rewards for reasoning tasks and model-based rewards for general tasks [27][30] - The rule-based rewards focus on accuracy and format compliance, while the model-based rewards assess the usefulness and safety of the outputs [28][31] Challenges and Future Directions - Despite its advanced reasoning capabilities, DeepSeek-R1 faces limitations in structured outputs and tool usage, and it is sensitive to prompt variations [43] - The reliance on reliable reward signals poses challenges, particularly for subjective tasks, which may lead to reward hacking [44]