强化学习
Search documents
蚂蚁Ring-1T正式登场,万亿参数思考模型,数学能力对标IMO银牌
机器之心· 2025-10-14 06:33
Core Insights - Ant Group has launched the Ling-1T and Ring-1T models, marking significant advancements in open-source AI with capabilities comparable to closed-source giants [3][6][19] - The Ring-1T model is the first open-source trillion-parameter reasoning model, showcasing exceptional performance in various benchmarks and tasks [6][9][19] Model Launch and Performance - Ant Group announced the Ling-1T model on October 9, which is their largest language model to date, achieving over a thousand downloads within four days of its release [3][5] - Following this, the Ring-1T model was officially launched on October 14, demonstrating superior reasoning abilities and achieving notable results in international mathematics competitions [6][19] Benchmark Testing - The Ring-1T model underwent rigorous testing across eight critical benchmarks, including mathematics competitions, code generation, and logical reasoning [12][14] - Results indicate that Ring-1T significantly outperformed its preview version, achieving state-of-the-art (SOTA) performance in multiple dimensions, particularly in complex reasoning tasks [9][14][16] Competitive Analysis - In logical reasoning tasks, Ring-1T surpassed the performance of leading closed-source models like Gemini-2.5-Pro, showcasing its competitive edge [16] - The model's performance in the Arena-Hard-v2.0 comprehensive ability test was just slightly behind GPT-5-Thinking, placing it among the top-tier models in the industry [16] Practical Applications - Ring-1T demonstrated its coding capabilities by generating functional game code for simple games like Flappy Bird and Snake, showcasing its practical application in software development [20][23] - The model also excelled in creative writing, producing engaging narratives and scripts that incorporate historical facts and storytelling techniques [40][43] Technical Innovations - The development of Ring-1T involved advanced reinforcement learning techniques, particularly the IcePop algorithm, which mitigates training inconsistencies and enhances model stability [45][46] - Ant Group's self-developed RL framework, ASystem, supports the efficient training of large-scale models, addressing hardware resource challenges and improving training consistency [50][52]
0人工参与实现梯度更新!MIT新框架让AI自动生成微调数据,权重自主升级
量子位· 2025-10-14 04:08
Core Viewpoint - The article discusses a new reinforcement learning framework called SEAL (Self-Adapting LLMs) developed by MIT, which enables large models to autonomously update their weights and learn new knowledge without human intervention [1][4][6]. Group 1: SEAL Framework Overview - SEAL employs a nested learning mechanism that consists of an external loop driven by reinforcement learning and an internal loop for parameter updates [4][26]. - The framework allows models to generate fine-tuning data and self-update instructions, thus overcoming the limitations of relying solely on external supervised data [6][25]. Group 2: Knowledge Incorporation Experiment - In the knowledge incorporation experiment, the Qwen2.5-7B model was tested using the SQuAD dataset, where it generated training data based on new paragraphs without seeing the corresponding questions [9][10]. - The accuracy of the model improved from 32.7% to 47.0% when using SEAL for fine-tuning, outperforming both original and GPT-4.1 generated data [14][15]. - SEAL demonstrated a significant accuracy of 58.2% when tested with longer paragraphs, indicating its ability to generalize to larger data organization tasks [16]. Group 3: Few-Shot Learning Experiment - In the few-shot learning experiment, the LLaMA-3.2-1B-Instruct model was evaluated using a subset of tasks from the ARC-AGI dataset [17][18]. - SEAL achieved a success rate of 72.5%, significantly higher than the 0% success rate of fixed few-shot prompts and 20% from random sampling strategies [22][23]. - Although SEAL's performance did not reach the optimal strategy (Oracle TTT) at 100%, it showcased strong task adaptability through self-discovered learning paths [22]. Group 4: Mechanism of SEAL - SEAL's process involves reading new information, rewriting it in its own language, and performing gradient updates for autonomous learning [25]. - The model generates self-edit instructions that describe how to update itself based on the current input, including information extraction and training parameters [28][29]. - The framework utilizes a non-traditional reinforcement learning method called ReSTEM, which focuses on behavior cloning and filtered sampling to optimize self-edit strategies [33][36].
卡帕西8000行代码手搓ChatGPT,成本仅100美元,训练12小时CORE表现超越GPT-2,手把手教程来了
3 6 Ke· 2025-10-14 03:40
Core Insights - The article discusses the launch of "nanochat," a simplified version of ChatGPT created by Andrej Karpathy, a former AI director at Tesla and co-founder of OpenAI, aimed at educational purposes [1][57]. - The project allows users to build a basic conversational AI model with a cost of approximately $100 and a training time of about 4 hours on a cloud GPU server [1][10]. Project Overview - "nanochat" consists of around 8000 lines of code and is implemented in Rust, featuring a tokenizer, a pre-trained Transformer model, and various training datasets [2][3]. - The model can perform basic conversational tasks, generate stories and poems, and answer simple questions [2][4]. Performance Metrics - After approximately 12 hours of training, the model's performance on the CORE metric surpasses that of GPT-2 [4][52]. - The model's performance metrics include CORE scores, ARC-Easy, GSM8K, and HumanEval, with notable improvements observed during different training phases [3][52]. Training Phases - The training process includes pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL) stages, each contributing to the model's capabilities [41][46]. - Mid-training focuses on adapting the model for multi-turn conversations and teaching it to handle multiple-choice questions [35][36]. Community Engagement - The project has gained significant attention on GitHub, with over 4.8k stars shortly after its release, indicating strong community interest and potential for further optimization [8][7]. - The codebase is designed to be user-friendly, allowing modifications and enhancements by the community [54][55]. Educational Impact - Karpathy aims to integrate this technology into a broader educational framework, potentially transforming how AI can assist in learning [62]. - The project is part of a larger initiative to create a symbiotic relationship between teachers and AI, enhancing the learning experience [62].
卡帕西8000行代码手搓ChatGPT,成本仅100美元,训练12小时CORE表现超越GPT-2,手把手教程来了
量子位· 2025-10-14 02:19
Core Insights - The article discusses the launch of "nanochat," a simplified version of ChatGPT created by Andrej Karpathy, which can be built with minimal cost and code [1][2][4]. Project Overview - "nanochat" is a full-stack training and inference pipeline that allows users to create a basic ChatGPT-like model with approximately 8000 lines of code [2][4]. - The entire project can be executed on a cloud GPU server for about $100, taking as little as 4 hours to set up and run [3][4][16]. Technical Specifications - The model is built using Rust and includes a tokenizer, a pre-trained Transformer architecture, and various training datasets [5]. - It supports efficient inference with features like KV caching and a lightweight Python interpreter for tool usage [5][43]. Performance Metrics - After about 12 hours of training, the model's performance on the CORE metric surpasses that of GPT-2 [8]. - A specific example shows that a model trained for 24 hours can achieve scores of over 40 on the MMLU dataset and over 70 on the ARC-Easy dataset [10]. Development Goals - Karpathy aims to create a unified, simple, and modifiable codebase that can serve as a strong baseline for future developments [11][13]. - The project is intended to be a capstone for the upcoming LLM101n course, which focuses on building large language models [12]. Community Engagement - The project has gained significant attention, with GitHub stars reaching 4.8k shortly after its release, indicating strong community interest [14]. - Users are encouraged to optimize and modify the codebase, allowing for a collaborative improvement process [59]. Training Process - The training process involves several stages: pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL) [45][48][51]. - The total time for the training process, excluding RL, is approximately 3 hours and 51 minutes, with a total cost of about $92.4 [57]. Final Remarks - The article emphasizes the potential of "nanochat" as a research tool and a framework for benchmarking, similar to previous projects like nanoGPT [13]. - The project is still in its early stages, with many opportunities for further optimization and enhancement [13][50].
《大模型的第一性思考》李建忠对话GPT5与Transformer发明者Lukasz Kaiser实录
3 6 Ke· 2025-10-13 10:46
Core Insights - The rapid development of large intelligent systems is reshaping industry dynamics, exemplified by OpenAI's recent release of Sora 2, which showcases advancements in model capabilities and the complexity of AI evolution [1][2] - The dialogue between industry leaders, including CSDN's Li Jianzhong and OpenAI's Lukasz Kaiser, focuses on foundational thoughts regarding large models and their implications for future AI development [2][5] Group 1: Language and Intelligence - Language plays a crucial role in AI, with some experts arguing that relying solely on language models for AGI is misguided, as language is a low-bandwidth representation of the physical world [6][9] - Kaiser emphasizes the importance of temporal dimensions in language, suggesting that the ability to generate sequences over time is vital for expressing intelligence [7][9] - The conversation highlights that while language models can form abstract concepts, they may not fully align with human concepts, particularly regarding physical experiences [11][12] Group 2: Multimodal Models and World Understanding - The industry trend is towards unified models that can handle multiple modalities, but current models like GPT-4 already demonstrate significant multimodal capabilities [12][13] - Kaiser acknowledges that while modern language models can process multimodal tasks, the integration of different modalities remains a challenge [13][15] - The discussion raises skepticism about whether AI can fully understand the physical world through observation alone, suggesting that language models may serve as effective world models in certain contexts [14][15] Group 3: AI Programming and Future Perspectives - AI programming is emerging as a key application of large language models, with two main perspectives on its future: one advocating for natural language as the primary programming interface and the other emphasizing the continued need for traditional programming languages [17][18] - Kaiser believes that language models will increasingly cover programming tasks, but a solid understanding of programming concepts will remain essential for professional developers [19][20] Group 4: Agent Models and Generalization Challenges - The concept of "agent models" in AI training faces challenges in generalizing to new tasks, raising questions about whether this is due to training methods or inherent limitations [21][22] - Kaiser suggests that the effectiveness of agent systems relies on their ability to learn from interactions with various tools and environments, which is currently limited [22][23] Group 5: Scaling Laws and Computational Limits - The belief in Scaling Laws as the key to stronger AI raises concerns about potential over-reliance on computational power at the expense of algorithmic and architectural advancements [24][25] - Kaiser differentiates between pre-training and reinforcement learning Scaling Laws, indicating that while pre-training has been effective, it may be approaching economic limits [25][26] Group 6: Embodied Intelligence and Data Efficiency - The slow progress in embodied intelligence, particularly in humanoid robots, is attributed to either data scarcity or fundamental differences between bits and atoms [29][30] - Kaiser argues that advancements in data efficiency and the development of multimodal models will be crucial for achieving effective embodied intelligence [30][31] Group 7: Reinforcement Learning and Scientific Discovery - The shift towards reinforcement learning-driven reasoning models presents both opportunities for innovation and challenges related to their effectiveness in generating new scientific insights [32][33] - Kaiser notes that while reinforcement learning offers high data efficiency, it has limitations compared to traditional gradient descent methods [33][34] Group 8: Organizational Collaboration and Future Models - Achieving large-scale collaboration among agents remains a significant challenge, with the need for more parallel processing and effective feedback mechanisms in training [35][36] - Kaiser emphasizes the necessity for next-generation reasoning models that can operate in a more parallel and efficient manner to facilitate organizational collaboration [36][37] Group 9: Memory Mechanisms in AI - Current AI models' memory capabilities are limited by context windows, resembling working memory rather than true long-term memory [37][38] - Kaiser suggests that future architectures may need to incorporate more sophisticated memory mechanisms to achieve genuine long-term memory capabilities [38][39] Group 10: Continuous Learning in AI - The potential for AI models to support continuous learning is being explored, with current models utilizing context as a form of ongoing memory [39][40] - Kaiser believes that while context learning is a step forward, more elegant solutions for continuous learning will be necessary in the future [40][41]
真正的AI竞争力,藏在大模型“后训练”这一步
量子位· 2025-10-13 08:47
Core Insights - The article emphasizes the importance of Post-Training as a transformative approach in AI, moving beyond simple model optimization to creating specialized intelligent engines tailored to specific business needs [1][4] - The evolution of Post-Training technology is highlighted, showcasing a shift from Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL) methodologies, which better align with complex business requirements [2][4] Summary by Sections Post-Training Evolution - The initial approach in the industry was SFT, which allowed models to learn specific domain knowledge and dialogue styles [2] - However, SFT was insufficient for teaching models complex value judgments and strategic choices, which are critical in real business scenarios [3] - The focus has shifted to RL, evolving from human-dependent methods (RLHF) to automated systems (RLVR) and the innovative use of Natural Language Rewards [4][5] Implementation Pathway - The article outlines a four-step pathway for enterprises to implement Post-Training effectively, addressing challenges such as data quality, high labeling costs, and defining reward signals [5][8] - Successful case studies from companies like Zhihu, AutoHome, and Weibo illustrate practical applications of these steps, showcasing improvements in data quality and model performance [7][8] Step 1: Data Preparation - High-quality data is identified as the cornerstone of successful Post-Training, with companies spending 60-70% of their time on data preparation [10] - Zhihu and AutoHome have developed methods to enhance data quality through pre-labeling and structured data utilization, respectively [11][13] Step 2: Model Selection - Choosing the right base model is crucial, with many companies opting for the Tongyi Qianwen series due to its performance and support for Post-Training [14][16] - The model's architecture and open-source ecosystem facilitate easier implementation of Post-Training techniques [15][18] Step 3: Reward Mechanism Design - The design of a reward mechanism is essential for aligning model outputs with business objectives, transitioning from human feedback to automated verification systems [24][25] - Companies like Yingmi Fund are exploring ways to integrate expert decision-making frameworks into their models to enhance performance [26] Step 4: Evaluation System - A robust evaluation system is necessary to measure the effectiveness of Post-Training, with Yingmi Fund developing benchmarks to assess model performance in real-world scenarios [27][28] - Successful implementations have led to significant improvements in model accuracy and business outcomes, as seen in the case of Baifeng Cloud and Quark [30][32] Conclusion - The article concludes that the true competitive advantage in AI lies in how companies leverage their unique data and business insights through Post-Training to create proprietary intelligent engines [32]
改变强化学习范式,Meta新作呼应Sutton「经验时代」预言
机器之心· 2025-10-13 06:37
Core Insights - The article discusses the transition from the data era to the experience era in AI, emphasizing the need for AI agents to learn from interactions with their environment rather than solely relying on data [1][2] - Meta's research introduces a new paradigm called "early experience," which allows AI agents to learn from their own actions and the resulting states, providing a way to generate supervisory signals without external rewards [2][3] Group 1: Early Experience Paradigm - The "early experience" paradigm combines imitation learning and reinforcement learning, enabling agents to learn from both curated data and their own experiences in the environment [2][3] - Meta's implementation of this paradigm improved task completion success rates by 9.6% and out-of-distribution generalization by 9.4%, indicating a significant advancement in AI training methodologies [3][25] Group 2: Methodologies - Two strategies were explored within the early experience framework: implicit world modeling and self-reflection [3][18] - Implicit world modeling uses collected states to predict future states, allowing agents to internalize environmental dynamics without separate modules [10][12] - Self-reflection enables agents to compare expert actions with their own generated actions, producing explanations that enhance decision-making and learning [13][14] Group 3: Experimental Results - Benchmark tests showed that the early experience methods outperformed traditional imitation learning across various scenarios, with implicit world modeling and self-reflection yielding notable improvements [21][22] - In out-of-distribution evaluations, early experience methods significantly reduced performance gaps, demonstrating their effectiveness in adapting to unseen environments [23] Group 4: Conclusion - The findings suggest that starting training with early experience leads to higher performance ceilings in subsequent reinforcement learning phases, acting as a bridge between the data and experience eras [25][26]
摆脱即时爽感,用小事找回创业节奏
3 6 Ke· 2025-10-13 00:20
在很多创业者眼里,自己凌晨在办公室里喝下一杯咖啡、随时关注不停闪烁的行业群、每15分钟看一次 数据后台,都是"为了抓住机会"的必要动作。 生活里,最常见的错觉是把"靠外界刺激提神"当成"抗压能力强"。凌晨改完方案后,用一碗重油重糖的 夜宵缓解疲惫,高糖带来的血糖骤升让大脑短暂兴奋,便以为"又扛过了一轮压力";早上困到睁不开 眼,灌下第三杯冰美式,咖啡因刺激神经后的"清醒感",被错认成"能继续高效工作"。 工作中,错觉则藏在"即时反馈依赖"里,很多人把"刷数据""盯消息"当成"掌控业务"。每隔15分钟刷新 一次用户数据,看到日活上涨0.5%就觉得"业务在往前走";行业群里的每一条新消息都不愿错过,手指 划过屏幕的"信息获得感",被误判为"保持敏感度";更不用说把"熬夜加班"当成"敬业的证明"。 这些行为本质,是多巴胺驱动的无效消耗:聚焦数据波动的短期刺激,可能会导致忽视用户体验等核心 问题;碎片化的信息接收,可能会让人没时间梳理出真正能落地的业务策略,最终陷入"越刷越焦虑, 越焦虑越想刷"的循环,把"盯业务"变成了"耗精力"。 有时候,创业者对多巴胺的依赖,会披着"为事业负责"的外衣——一些被解读为"高效""敬 ...
聊聊 AI Agent 到底有多大创新?
自动驾驶之心· 2025-10-12 23:33
作者 | sunnyzhao 编辑 | 大模型之心Tech 1,planing阶段带来了巨大的耗时,当tool变多后,turbo系列模型的准确率堪忧,因此不得不使用旗舰模型,这让延时进一步增 加。 2,planing的质量不够高,原来的task bot做任务所使用的workflow是人工决定的,现在改成了模型自助决定,从目前的测试来 看,由模型构建的复杂工作流的可用率远远不及人类水平。简单工作流使用判别式小模型反而性能更好。 3,reflection是一种时间换准确度的策略,然而这个策略非常容易重复进行自我内耗,和死循环。 这几个问题,确实是目前AI Agent技术的通病。如果把Agent当成"LLM+工具调用"的简单组合,没有认真处理工程细节,实际的 效果也确实未必比工作流编排就更好。主要结合看到一些论文,和一点实际经验,按题主说到的三点谈一下自己的看法。 本文只做学术分享,如有侵权,联系删文 ,自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一步咨询 Planning慢的本质原因 原文链接: https://www.zhihu.com/question/657739588/ ...
智驾最后的窗口期,冲出AI新玩家
远川研究所· 2025-10-12 13:04
Core Insights - The intelligent assisted driving industry has experienced a stark contrast over the past year, with advancements in technology leading to increased consumer demand and cost reductions, allowing L2+ systems to penetrate the mid-to-low-end market [2][4][5] - The competitive landscape is intensifying, with a clear emergence of leading players, and companies must adapt to new technological paradigms to remain relevant [2][9] - The rise of multi-modal large models and end-to-end systems is reshaping the industry, with companies like Qianli Technology positioning themselves strategically to leverage these advancements [12][21] Industry Dynamics - The shift from modular to end-to-end architectures in intelligent driving systems is becoming a standard, as exemplified by Tesla's FSD V9.0, which emphasizes a pure vision-based approach [4][5][6] - The software value in intelligent driving systems is projected to exceed 40% of the total vehicle value, indicating a significant shift in the industry's focus towards software-driven solutions [6][18] - The competitive landscape is characterized by a mix of vertically integrated companies like Tesla and third-party suppliers, highlighting the importance of collaboration and resource integration [9][18] Company Developments - Qianli Technology, founded by AI pioneer Yin Qi, aims to become a platform-level AI company, focusing on intelligent assisted driving and smart cockpit solutions [11][21] - The company has established partnerships with major automotive players, including Geely, to enhance its market presence and technological capabilities [17][25] - Qianli Technology's RLM (Reinforcement Learning-Multi-modal) model is gaining attention for its ability to improve driving experience and safety through advanced perception and decision-making capabilities [21][24] Future Trends - The integration of multi-modal large models and reinforcement learning is expected to be crucial for the future of intelligent driving systems, enhancing their adaptability and safety [20][22] - The global market for automated and intelligent driving vehicles is projected to reach $1.2 trillion by 2040, with significant growth opportunities for companies like Qianli Technology [25] - The development of Robotaxi services is a key focus for Qianli Technology, aiming to establish a comprehensive operational framework within 18 months [27]