强化学习
Search documents
腾讯发布超低成本AI训练法!120元效果秒杀70000元微调方案
量子位· 2025-10-15 06:27
Core Viewpoint - Tencent proposes a new method for upgrading large model agents called Training-Free GRPO, which significantly reduces costs and improves performance without the need for parameter tuning [1][5][11]. Group 1: Methodology - The Training-Free GRPO method allows for performance enhancement by learning from brief experiences embedded in prompts, eliminating the need for parameter adjustments [2][11]. - This approach maintains the model parameters in a frozen state while dynamically updating an external knowledge base to optimize performance [14][22]. - The method leverages the core logic of traditional GRPO but transforms it into a non-parametric reasoning process [13]. Group 2: Experimental Results - Experiments demonstrate that the DeepSeek-V3.1-Terminus model using Training-Free GRPO shows significant performance improvements in mathematical reasoning and web search tasks [4][25]. - Compared to fine-tuning a 32B model, Training-Free GRPO requires less training data and incurs lower costs, with a notable example being a cost of approximately $18 compared to over $10,000 for traditional methods [5][28]. - In the AIME24 and AIME25 tests, the model's performance improved from 80.0% to 82.7% and from 67.9% to 73.3%, respectively, showcasing a clear advantage with minimal training samples [28]. Group 3: Performance Evaluation - The method achieved a Pass@1 score of 67.8% on the WebWalkerQA benchmark, a significant increase from the baseline score of 63.2% [35]. - The results indicate that the learned experiences help the model avoid redundant tool calls and improve decision-making efficiency [31][30]. - The effectiveness of Training-Free GRPO is contingent upon the underlying model's reasoning and tool usage capabilities, as demonstrated by its lower performance on less capable models [40].
刚刚,UCLA周博磊也加入了一家机器人公司
机器之心· 2025-10-15 02:54
Core Insights - Coco Robotics has appointed Bolei Zhou, a UCLA associate professor, as the Chief AI Scientist to lead the newly established Physical AI Lab, focusing on autonomous driving for sidewalks [1][2][4] - The company aims to achieve full automation in last-mile delivery, leveraging the extensive operational data collected over the past five years [2][4][6] Group 1: Company Overview - Coco Robotics, founded in 2020, specializes in last-mile delivery robotics and initially relied on teleoperators to navigate obstacles [2][4] - The company has accumulated millions of miles of data in complex urban environments, which is crucial for training reliable AI systems [4][6] Group 2: Research and Development - The Physical AI Lab will utilize the data collected to enhance automation and operational efficiency, focusing on local models used by their robots [6] - The lab operates independently from Coco Robotics' collaboration with OpenAI, which allows the use of OpenAI's models while sharing data for mutual benefit [5][6] Group 3: Bolei Zhou's Background - Bolei Zhou holds a PhD from MIT and has a strong research background in machine perception and intelligent decision-making, with over 100 publications and significant contributions to explainable AI [9][11][13] - His notable works include Class Activation Mapping (CAM) and the creation of the Places database, which contains over 10 million labeled scene images, enhancing scene recognition capabilities [11][18][20]
卡帕西 8000 行代码手搓 ChatGPT,成本仅100美元,训练 12 小时 CORE 表现超越GPT-2
程序员的那些事· 2025-10-15 00:44
Core Insights - The article discusses the launch of "nanochat," a simplified version of ChatGPT created by Andrej Karpathy, which can be built with minimal resources and code [1][2][4]. - The project aims to provide an accessible framework for training language models, emphasizing ease of use and modification [11][13]. Project Overview - "Nanochat" is a full-stack training and inference pipeline that allows users to create a basic ChatGPT-like model with approximately 8000 lines of code [2][4]. - The total cost to train this model is around $100, using a cloud GPU server for about 4 hours [4][16]. - The model is built using Rust and includes a custom tokenizer, with training conducted on the FineWeb dataset [5][19]. Performance Metrics - After approximately 12 hours of training, the model's performance on the CORE metric surpasses that of GPT-2 [8]. - Specific performance metrics include: - CORE: 0.2219 - ARC-Easy: 0.3876 - GSM8K: 0.0758 - HumanEval: 0.0854 - MMLU: 0.3151 [7][56]. Training Process - The training process involves several stages: pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL) [45][50]. - The pre-training phase utilizes a large dataset to teach the model about the world, while mid-training focuses on adapting the model for conversational tasks [28][45]. - The SFT phase further refines the model using high-quality dialogue data [48]. Community Engagement - The project has gained significant attention, with over 4.8k stars on GitHub shortly after its release, indicating strong community interest [14]. - The framework is designed to be easily modifiable, allowing users to experiment with different parameters and configurations [59]. Future Potential - Karpathy envisions "nanochat" evolving into a research tool or benchmark framework, similar to previous projects like nanoGPT [13]. - The project is still in its early stages, with potential for further optimization and enhancement [13][50].
CoreWeave:一场价值数万亿美元的盛宴
3 6 Ke· 2025-10-15 00:29
Core Viewpoint - The integration of large language models (LLM) and reinforcement learning (RL) is accelerating the development of autonomous intelligent agents, positioning CoreWeave as a key cloud service provider for the AI infrastructure needed in this new phase [1] Group 1: Business Strategy and Expansion - CoreWeave's acquisition of OpenPipe is a significant move to enhance its capabilities in the reinforcement learning space, allowing it to train intelligent agents and gain developer recognition [2] - The transition from a "hardware + API" model to a comprehensive "intelligent agent support platform" represents a qualitative leap in CoreWeave's offerings [3] - The integration of reinforcement learning services is expected to significantly enhance profit margins, creating a competitive barrier that traditional hardware rental models cannot match [4] Group 2: Infrastructure Requirements - Intelligent agents require a high-performance infrastructure that includes high throughput system interconnects, fast memory, rollback architecture, data monitoring, error recovery, and modular subroutines, which traditional cloud providers cannot adequately supply [5] - The computational demands of intelligent agents are projected to be several orders of magnitude greater than traditional static inference, with the global data center spending on computing expected to rise from hundreds of billions to trillions [6][7] Group 3: Financial Performance and Market Potential - CoreWeave's quarterly sales surged by 200% year-over-year to approximately $1.21 billion, with a backlog of nearly $30 billion, indicating strong future demand [8] - The shift towards intelligent agent models is expected to drive significant growth in the market, with conservative estimates suggesting that by 2030, annual spending on computational resources could reach $1.8 trillion [9] - CoreWeave's ability to capture value from the entire decision-making cycle of intelligent agents positions it favorably against competitors, enhancing its long-term profitability [10] Group 4: Valuation and Future Outlook - CoreWeave's current valuation aligns with GPU-intensive cloud service peers, with an estimated enterprise value (EV) range of $80-100 billion, potentially increasing to $120 billion if the demand for reinforcement learning training accelerates [13] - The company's strategic shift towards becoming a comprehensive provider of reinforcement learning training solutions is expected to expand its valuation range as the revenue structure increasingly leans towards software services [14]
CoreWeave:一场价值数万亿美元的盛宴
美股研究社· 2025-10-14 12:30
Core Viewpoint - The integration trend of large language models (LLM) and reinforcement learning (RL) is accelerating the development of "autonomous agents," which are AI systems capable of making decisions and executing tasks. CoreWeave is positioning itself as a core cloud service provider that can meet the demands of a reinforcement learning-driven future, making it a high-certainty target for the next phase of AI infrastructure [1]. Business Expansion - CoreWeave's business coverage is rapidly expanding, allowing it to push its infrastructure and services to more markets and enterprises, laying the foundation for scalable services in the agent era [2]. Transition to Agent Operation Platform - The acquisition of OpenPipe is a key move for CoreWeave to break into the "upstream of the value chain." OpenPipe's core competency is a "reinforcement learning toolkit" that enables developers to train agents and adapt models to new task requirements [4]. Technological Integration - CoreWeave is transforming from a "hardware layer + API interface" to a "full-cycle support platform for agents," representing a qualitative change in its service offerings [5]. Demand and Profitability - The workload related to agents is growing exponentially, leading to a continuous surge in computing power demand. In-house reinforcement learning tools and runtime services are expected to significantly expand profit margins [6]. One-Stop Solution - CoreWeave integrates various functionalities into its technology stack, forming a "one-stop solution" for developers, which will become a core dependency for clients over time, creating a competitive barrier [7]. Infrastructure Requirements - The infrastructure requirements for agents are significantly more complex than traditional AI inference, necessitating high-throughput system interconnects, fast memory, rollback architectures, and real-time monitoring capabilities [9]. Market Growth Potential - The computing power consumed by agent AI is expected to be several orders of magnitude greater than traditional "static inference." The global data center spending on computing power is projected to rise from "hundreds of billions" to "trillions" in the coming years [11]. Competitive Advantage - CoreWeave, as a leader among "AI-native new cloud vendors," is poised to capture a significant share of the trillion-dollar market, benefiting from its first-mover advantage in reinforcement learning training [12]. Revenue Growth - CoreWeave's quarterly sales surged by 200% year-on-year to approximately $1.21 billion, with a backlog of nearly $30 billion, indicating strong long-term demand for its services [14]. Market Valuation - CoreWeave's valuation is currently comparable to its GPU-intensive cloud service peers, with a forward EV/Sales ratio of about 5-6 times. If the platform business revenue share increases to 30%, the enterprise value could approach $120 billion [20].
各大顶会对RL和这些工作的结合很青睐~
具身智能之心· 2025-10-14 10:00
Core Insights - Reinforcement Learning (RL) remains a significant field with ongoing developments and applications in various domains, including robotics and product optimization [1][2][3] - The importance of gait control in embodied intelligent robots is highlighted, with RL being the primary method for achieving complex movements [2][8] - The complexity of RL poses challenges for newcomers, necessitating structured guidance to facilitate entry into the field and successful paper publication [5][9] Group 1: Importance of Reinforcement Learning - RL is not an outdated discipline; it continues to be relevant with numerous applications in robotics, such as humanoid and quadruped robots [1][2] - Companies like Yushun and Zhiyuan utilize RL for training robots to perform various challenging tasks, including climbing stairs and running [2][8] - The integration of RL with Variable Length Action (VLA) in robotic arms is gaining traction in academic research, enhancing the efficiency of robotic operations [3][8] Group 2: Challenges in Learning and Research - The extensive and complex nature of RL makes it difficult for beginners to navigate, often leading to frustration and abandonment of studies [5][9] - A lack of a comprehensive learning framework can result in repeated mistakes and missed opportunities in research [6][9] - The introduction of a specialized 1v6 mentoring course aims to address these challenges by providing structured support for students in the RL field [6][9] Group 3: Course Structure and Offerings - The course spans 14 weeks of intensive online guidance followed by 8 weeks of maintenance support, focusing on producing a publishable paper [10][11] - Weekly live sessions will cover various topics, including RL fundamentals, simulation environments, and writing guidance, with a focus on practical applications [17][21] - Participants will have the opportunity to work on specific ideas related to quadruped, humanoid, and robotic arm research, with a structured approach to project development and writing [18][25]
0人工参与实现梯度更新,,MIT新框架让AI自动生成微调数据,权重自主升级
3 6 Ke· 2025-10-14 07:16
Core Insights - MIT has introduced a new reinforcement learning framework called SEAL (Self-Adapting LLMs), enabling models to generate fine-tuning data and self-update instructions autonomously, allowing for model weight updates without human intervention [1][3] Group 1: SEAL Framework Overview - SEAL employs a nested learning mechanism that calculates rewards based on the updated model's performance on tasks, optimizing the generation strategy for self-update instructions [3] - The framework provides large models with self-driven update capabilities at the weight level, overcoming the limitations of relying solely on external supervised data [3] Group 2: Knowledge Incorporation Experiment - In the knowledge incorporation experiment, the Qwen2.5-7B model was tested using the SQuAD dataset, where the model generated training data based on new paragraphs without seeing the corresponding answers [5] - The accuracy of the Qwen original model was 32.7%, which improved to 33.5% with original fine-tuning, 46.3% with GPT-4.1 synthetic data, and reached 47.0% using the SEAL method, demonstrating superior knowledge integration capabilities [6][10] Group 3: Large-Scale Data Testing - SEAL achieved an accuracy of 58.2% when tested with longer paragraphs, significantly outperforming the unoptimized version, indicating its ability to generalize to larger data organization tasks [8] Group 4: Few-Shot Learning Experiment - In the few-shot learning experiment, the LLaMA-3.2-1B-Instruct model was used with a subset of tasks from the ARC-AGI dataset, where SEAL generated a training configuration and executed LoRA fine-tuning [11][13] - The success rate of tasks trained with SEAL reached 72.5%, far exceeding the 0% success rate of fixed few-shot prompts and 20% of random sampling strategies, showcasing SEAL's strong task adaptation ability [15][16] Group 5: SEAL's Operational Mechanism - SEAL operates through a dual-loop system that automatically generates training instructions, allowing the model to read new information, rewrite it in its own language, and perform gradient updates for self-learning [17][18] - The outer loop generates self-edit instructions based on new input, while the inner loop executes fine-tuning according to these instructions, constructing synthetic training data and updating weights [18][20] - SEAL utilizes a non-traditional reinforcement learning method called ReSTEM, which focuses on behavior cloning and filtered sampling to optimize the generation of effective self-edit strategies [20]
蚂蚁Ring-1T正式登场,万亿参数思考模型,数学能力对标IMO银牌
机器之心· 2025-10-14 06:33
Core Insights - Ant Group has launched the Ling-1T and Ring-1T models, marking significant advancements in open-source AI with capabilities comparable to closed-source giants [3][6][19] - The Ring-1T model is the first open-source trillion-parameter reasoning model, showcasing exceptional performance in various benchmarks and tasks [6][9][19] Model Launch and Performance - Ant Group announced the Ling-1T model on October 9, which is their largest language model to date, achieving over a thousand downloads within four days of its release [3][5] - Following this, the Ring-1T model was officially launched on October 14, demonstrating superior reasoning abilities and achieving notable results in international mathematics competitions [6][19] Benchmark Testing - The Ring-1T model underwent rigorous testing across eight critical benchmarks, including mathematics competitions, code generation, and logical reasoning [12][14] - Results indicate that Ring-1T significantly outperformed its preview version, achieving state-of-the-art (SOTA) performance in multiple dimensions, particularly in complex reasoning tasks [9][14][16] Competitive Analysis - In logical reasoning tasks, Ring-1T surpassed the performance of leading closed-source models like Gemini-2.5-Pro, showcasing its competitive edge [16] - The model's performance in the Arena-Hard-v2.0 comprehensive ability test was just slightly behind GPT-5-Thinking, placing it among the top-tier models in the industry [16] Practical Applications - Ring-1T demonstrated its coding capabilities by generating functional game code for simple games like Flappy Bird and Snake, showcasing its practical application in software development [20][23] - The model also excelled in creative writing, producing engaging narratives and scripts that incorporate historical facts and storytelling techniques [40][43] Technical Innovations - The development of Ring-1T involved advanced reinforcement learning techniques, particularly the IcePop algorithm, which mitigates training inconsistencies and enhances model stability [45][46] - Ant Group's self-developed RL framework, ASystem, supports the efficient training of large-scale models, addressing hardware resource challenges and improving training consistency [50][52]
0人工参与实现梯度更新!MIT新框架让AI自动生成微调数据,权重自主升级
量子位· 2025-10-14 04:08
Core Viewpoint - The article discusses a new reinforcement learning framework called SEAL (Self-Adapting LLMs) developed by MIT, which enables large models to autonomously update their weights and learn new knowledge without human intervention [1][4][6]. Group 1: SEAL Framework Overview - SEAL employs a nested learning mechanism that consists of an external loop driven by reinforcement learning and an internal loop for parameter updates [4][26]. - The framework allows models to generate fine-tuning data and self-update instructions, thus overcoming the limitations of relying solely on external supervised data [6][25]. Group 2: Knowledge Incorporation Experiment - In the knowledge incorporation experiment, the Qwen2.5-7B model was tested using the SQuAD dataset, where it generated training data based on new paragraphs without seeing the corresponding questions [9][10]. - The accuracy of the model improved from 32.7% to 47.0% when using SEAL for fine-tuning, outperforming both original and GPT-4.1 generated data [14][15]. - SEAL demonstrated a significant accuracy of 58.2% when tested with longer paragraphs, indicating its ability to generalize to larger data organization tasks [16]. Group 3: Few-Shot Learning Experiment - In the few-shot learning experiment, the LLaMA-3.2-1B-Instruct model was evaluated using a subset of tasks from the ARC-AGI dataset [17][18]. - SEAL achieved a success rate of 72.5%, significantly higher than the 0% success rate of fixed few-shot prompts and 20% from random sampling strategies [22][23]. - Although SEAL's performance did not reach the optimal strategy (Oracle TTT) at 100%, it showcased strong task adaptability through self-discovered learning paths [22]. Group 4: Mechanism of SEAL - SEAL's process involves reading new information, rewriting it in its own language, and performing gradient updates for autonomous learning [25]. - The model generates self-edit instructions that describe how to update itself based on the current input, including information extraction and training parameters [28][29]. - The framework utilizes a non-traditional reinforcement learning method called ReSTEM, which focuses on behavior cloning and filtered sampling to optimize self-edit strategies [33][36].
卡帕西8000行代码手搓ChatGPT,成本仅100美元,训练12小时CORE表现超越GPT-2,手把手教程来了
3 6 Ke· 2025-10-14 03:40
Core Insights - The article discusses the launch of "nanochat," a simplified version of ChatGPT created by Andrej Karpathy, a former AI director at Tesla and co-founder of OpenAI, aimed at educational purposes [1][57]. - The project allows users to build a basic conversational AI model with a cost of approximately $100 and a training time of about 4 hours on a cloud GPU server [1][10]. Project Overview - "nanochat" consists of around 8000 lines of code and is implemented in Rust, featuring a tokenizer, a pre-trained Transformer model, and various training datasets [2][3]. - The model can perform basic conversational tasks, generate stories and poems, and answer simple questions [2][4]. Performance Metrics - After approximately 12 hours of training, the model's performance on the CORE metric surpasses that of GPT-2 [4][52]. - The model's performance metrics include CORE scores, ARC-Easy, GSM8K, and HumanEval, with notable improvements observed during different training phases [3][52]. Training Phases - The training process includes pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL) stages, each contributing to the model's capabilities [41][46]. - Mid-training focuses on adapting the model for multi-turn conversations and teaching it to handle multiple-choice questions [35][36]. Community Engagement - The project has gained significant attention on GitHub, with over 4.8k stars shortly after its release, indicating strong community interest and potential for further optimization [8][7]. - The codebase is designed to be user-friendly, allowing modifications and enhancements by the community [54][55]. Educational Impact - Karpathy aims to integrate this technology into a broader educational framework, potentially transforming how AI can assist in learning [62]. - The project is part of a larger initiative to create a symbiotic relationship between teachers and AI, enhancing the learning experience [62].