Workflow
强化学习
icon
Search documents
Agent微调复活?英伟达开源8B新模型带飞GPT-5:在HLE狂卷37分,还把成本打下来
量子位· 2025-12-07 04:35
Core Insights - The article introduces a new paradigm in AI model orchestration, utilizing a smaller 8B model as a conductor to coordinate various tools and larger models, achieving better performance at lower costs [1][13]. Group 1: Model Performance - The Orchestrator-8B model achieved a score of 37.1% in the Humanity's Last Exam, outperforming GPT-5, which scored 35.1%, while also reducing computational costs by 2.5 times [1][9]. - In the FRAMES benchmark, Orchestrator-8B scored 76.3, compared to GPT-5's 74.0, and in the τ²-Bench, it scored 80.2 against GPT-5's 77.7 [9][10]. - The average cost for Orchestrator-8B was only 9.2 cents, with a latency of 8.2 minutes, significantly lower than GPT-5 [9][10]. Group 2: ToolOrchestra Framework - ToolOrchestra integrates various tools into a unified JSON interface, allowing the 8B conductor to think, call, and read feedback in multiple rounds until convergence [4]. - The framework employs GRPO reinforcement learning to maximize three rewards: correctness, efficiency, and user preference [4][5]. Group 3: User Preferences and Biases - The article highlights two biases in large models: self-enhancing bias, where models prefer to call upon similar models, and blind reliance on the strongest models, leading to increased costs [4][5]. - User preferences are taken into account, allowing the conductor to balance between local and cloud searches, speed, and cost [5][15]. Group 4: Application Scenarios - The Orchestrator-8B can be applied in various scenarios, such as internal Q&A and report analysis, where it defaults to local indexing and code execution for 80% of tasks [16]. - In research and development, it can set time and cost limits while considering source preferences [16]. - The framework allows for an end-to-end orchestration of functions and tools, moving away from rigid programming structures [16]. Group 5: Future Directions - The paper has made all code, models, and datasets publicly available for academic and industrial follow-up [14]. - The approach emphasizes a shift from relying solely on the strongest models to a more efficient use of diverse tools and models, enhancing cost-effectiveness and performance [15].
LLM强化学习不稳定之谜,被Qwen团队从「一阶近似」视角解开
机器之心· 2025-12-07 04:33
Core Insights - Reinforcement Learning (RL) has become a key technology paradigm for enhancing the complex reasoning and problem-solving capabilities of Large Language Models (LLMs) [2] - The main challenge in RL for LLMs is the mismatch between sequence-level rewards and token-level optimization objectives, raising concerns about theoretical soundness and training stability [2][5] - A new RL formulation method proposed by Alibaba's Qianwen team focuses on optimizing the expected value of sequence-level rewards using a surrogate token-level objective as a first-order approximation [2][11] Methodology - The team defines an autoregressive LLM represented by a policy π_θ, focusing on sequence-level rewards where a scalar reward R(x, y) is assigned to the entire response y [6] - The decision to avoid value function methods stems from the difficulty in constructing a general, scalable, and reliable value model [7] - Directly optimizing the expected sequence-level reward is challenging due to numerical differences between training and inference [9] Key Findings - The team conducted extensive experiments using a 30 billion parameter MoE model, consuming hundreds of thousands of GPU hours [4] - The introduction of on-policy training with importance sampling correction achieved the highest training stability [10] - In off-policy updates, both clipping and Routing Replay are essential for maintaining training stability, as their absence leads to performance degradation [23] Experimental Results - The MiniRL algorithm, which incorporates importance sampling, demonstrated the best performance and stability during training [22] - The removal of importance sampling correction during training led to rapid collapse and a sharp decrease in entropy, confirming its critical role in the first-order approximation [22] - Different cold-start initialization methods yielded similar final performance, indicating that the focus should be on the RL methods themselves rather than initialization details [27]
深扒PI π*0.6迭代式强化学习思路:VLA+在线RL,实现自我进化
具身智能之心· 2025-12-07 03:03
Core Insights - The article discusses the advancements in embodied intelligence, particularly focusing on the VLA (Vision-Language-Action) model and its integration with reinforcement learning (RL) to enhance robotic capabilities [2][3][4]. Group 1: Importance of VLA and RL - VLA models are crucial in embodied AI as they apply powerful vision-language models to robot control, but mere imitation learning is insufficient for robust performance in novel situations [6][9]. - Online RL allows robots to discover better solutions through trial and error, overcoming the limitations of offline RL which is constrained by the quality of demonstration data [9][10]. Group 2: Challenges in Applying RL to VLA - The application of RL in VLA faces three main challenges: environmental differences, model instability, and computational demands [22]. - Directly applying RL to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [22][23]. Group 3: iRe-VLA Model and Its Innovations - The iRe-VLA model introduces a two-phase iterative learning process that combines exploration and consolidation of learned behaviors [18][25]. - The first phase involves online RL where the robot explores new tasks while keeping the VLM parameters frozen, focusing on training a lightweight action head [30][32]. - The second phase employs supervised learning to internalize successful trajectories discovered during exploration, allowing the model to leverage its full capacity [40][43]. Group 4: Experimental Results and Effectiveness - Experiments in both simulated environments and real-world scenarios demonstrate that iRe-VLA significantly improves task success rates compared to traditional methods [45][49]. - The model shows a marked increase in performance, with success rates rising from 43% to 83% in benchmark tasks, and from 35% to 80% in real-world object manipulation tasks [49][56]. Group 5: Conclusion and Future Directions - The article concludes that the iRe-VLA framework effectively addresses the challenges of deploying large models in robotic control, paving the way for future research in efficient exploration and stable RL algorithms [61][63]. - The approach balances computational efficiency by distributing lightweight tasks to local robots while reserving heavy computations for cloud servers, facilitating practical deployment [65].
英伟达巧用8B模型秒掉GPT-5,开源了
量子位· 2025-12-06 05:40
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 英伟达端着一个8B小模型对GPT-5说: 不好意思,你还得练(bushi)。 何出此言?——英伟达携手香港大学开源的 Orchestrator-8B ,人类终极考试HLE分数更高、花钱更少、跑起来速度还更快。 哦对了,还在HuggingFace被狂赞,冲到了热门模型前五。 | Models 2,261,108 ( Filter by name | Full-text search ¥ Inf | | --- | --- | | Tongyi-MAI/Z-Image-Turbo | | | 136 Text-to-Image · Updated 3 days ago · ¿ 136k · † · ♡ 2.08k | | | deepseek-ai/DeepSeek-V3.2 | | | 17 Text Generation · .:: 685B · Updated 4 days ago · ¿ 8.69k · ↓ · ♡ 714 | | | � deepseek-ai/DeepSeek-V3.2-Speciale | | | 17 Text Generati ...
Yann LeCun离开Meta后首篇论文?使用了宇树机器人做研究
机器之心· 2025-12-06 04:08
Core Insights - The article discusses a groundbreaking research paper that introduces a method called GenMimic, enabling humanoid robots to perform actions generated from AI video models without prior examples [1][3][4]. Research Contributions - The research presents a universal framework for humanoid robots to execute actions generated by video models [4]. - GenMimic employs a new reinforcement learning strategy that utilizes symmetric regularization and selectively weighted 3D keypoint rewards for training, allowing generalization to noisy synthetic videos [4]. - The team created a synthetic human action dataset named GenMimicBench, which serves as a scalable benchmark for evaluating zero-shot generalization and policy robustness [4][8]. GenMimicBench Dataset - GenMimicBench consists of 428 generated videos created using advanced video generation models Wan2.1 and Cosmos-Predict2 [9][11]. - The dataset includes a wide range of subjects, environments, and action types, from simple gestures to complex interactions with objects [11][13]. - It is designed to stress-test the robustness of humanoid robot control strategies under varying visual and action distributions [13]. Methodology Overview - The proposed method involves a two-stage process for executing humanoid robot actions from generated videos [15][17]. - The first stage focuses on reconstructing the humanoid robot's 4D model from the input RGB video, while the second stage translates this model into executable actions [17][18]. - The strategy emphasizes robustness to variations and noise in the input data by using 3D keypoints instead of joint angles [19][20]. Experimental Results - The team conducted extensive experiments on both the GenMimicBench dataset and a real-world 23-DoF humanoid robot, demonstrating significant improvements over strong baseline models [29][30]. - In simulations, GenMimic achieved a success rate (SR) of 29.78% and outperformed existing models in various metrics [31]. - Real-world experiments showed that the strategy successfully replicated a wide range of upper-body actions, although challenges remained with lower-body movements [34][35].
碾压π0.5,复旦团队首创「世界模型+具身训练+强化学习」闭环框架
机器之心· 2025-12-04 08:18
Core Viewpoint - The Vision–Language–Action (VLA) strategy is becoming a crucial technological pathway for robots to achieve general operational intelligence, enabling simultaneous processing of visual perception, language instructions, and generation of continuous control signals [2]. Group 1: Challenges in Current VLA Approaches - Most current VLA methods rely heavily on imitation learning, which can lead to error accumulation and task failure when there are distribution shifts or changes in task forms [3][11]. - Implementing online reinforcement learning (RL) on real robots is costly and limited by the need for extensive human intervention and monitoring, making large-scale deployment impractical [12]. - Traditional physics engines struggle to balance realism, scene diversity, and engineering usability, complicating the use of RL in simulated environments [13]. Group 2: ProphRL Framework - The research team proposed the ProphRL framework, utilizing a large-scale pre-trained world model called Prophet as a video-level simulator to optimize VLA strategies through online RL algorithms [4]. - This approach allows for significant reductions in real-world interaction costs while maintaining physical credibility, facilitating the practical implementation of large model VLA strategies [4]. Group 3: Experimental Results - ProphRL demonstrated a success rate improvement of 5–17% across various VLA models in public benchmarks, with real robot experiments showing a substantial success rate increase of 24–30% [8]. - The Prophet model achieved leading performance in visual fidelity and action consistency across multiple datasets, showcasing its ability to generalize across new scenes and tasks with minimal fine-tuning [31]. Group 4: Innovations in RL Algorithms - The research introduced FA-GRPO and FlowScale, RL algorithms tailored for flow-based action heads, enhancing training stability and performance by reorganizing gradient signals and balancing contributions from different steps [26][27]. - A video-language reward model was developed to evaluate task success based on the entire trajectory, moving away from manually designed geometric distances [26]. Group 5: Real-World Validation - The ProphRL framework was validated on real robots, achieving significant improvements in task success rates across various complex tasks, indicating the effectiveness of the world model and RL integration in practical applications [38].
速报!MEET2026嘉宾阵容再更新,观众报名从速
量子位· 2025-12-04 05:57
Core Insights - The MEET2026 Smart Future Conference will focus on cutting-edge technologies and industry developments that have garnered significant attention throughout the year [1][2] - The theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" emphasizes how AI and smart technologies are penetrating various industries, disciplines, and scenarios, becoming a core driving force for societal evolution [2] Group 1: Conference Highlights - The conference will cover hot topics in the tech circle this year, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [3] - There will be a collision of academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and product industries [4] - The event will feature the authoritative release of the annual AI rankings and the annual AI trend report [5][116] Group 2: Notable Speakers - Zhang Yaqin, President of Tsinghua University's Intelligent Industry Research Institute and an academician of the Chinese Academy of Engineering, will be a key speaker [11][12] - Sun Maosong, Executive Vice President of Tsinghua University's AI Research Institute, has led numerous national projects and is a prominent figure in AI research [15] - Wang Zhongyuan, Director of the Beijing Academy of Artificial Intelligence, has extensive experience in AI core technology development and has published over 100 papers [19] Group 3: Industry Impact - The annual AI rankings initiated by Quantum Bit have become one of the most influential lists in the AI industry, evaluating companies, products, and individuals across three dimensions [117] - The annual AI trend report will analyze ten significant AI trends based on technology maturity, implementation status, and potential value, highlighting representative organizations and best cases [118] - The conference aims to attract thousands of tech professionals and millions of online viewers, establishing itself as an annual barometer for the smart technology industry [122]
速报!MEET2026嘉宾阵容再更新,观众报名从速
量子位· 2025-12-03 02:38
Core Insights - The MEET2026 Smart Future Conference will focus on cutting-edge technologies and industry developments that have garnered significant attention throughout the year [1] - The theme "Symbiosis Without Boundaries, Intelligence to Ignite the Future" emphasizes how AI and smart technologies penetrate various industries, disciplines, and scenarios, becoming a core driving force for societal evolution [2] Group 1: Conference Highlights - The conference will cover hot topics in the tech circle this year, including reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI going global [3] - It will feature the latest collisions between academic frontiers and commercial applications, showcasing leading technological achievements from infrastructure, models, and product industries [4] - The event will also include the authoritative release of the annual AI rankings and the annual AI trend report [5][116] Group 2: Notable Speakers - Zhang Yaqin, President of Tsinghua University's Intelligent Industry Research Institute and an academician of the Chinese Academy of Engineering, will be a key speaker [11] - Sun Maosong, Executive Vice President of Tsinghua University's Artificial Intelligence Research Institute, has led numerous national projects [15] - Wang Zhongyuan, Director of the Beijing Academy of Artificial Intelligence, has extensive experience in AI core technology research [19] - Wang Ying, Vice President of Baidu Group, oversees several key business units including Baidu Wenku and Baidu Netdisk [24] - Han Xu, Founder and CEO of WeRide, has led the company to become a leader in autonomous driving technology [28] Group 3: Annual AI Rankings and Trends - The "Artificial Intelligence Annual Rankings" initiated by Quantum Bit has become one of the most influential rankings in the AI industry, evaluating companies, products, and individuals across three dimensions [117] - The "2025 Annual AI Top Ten Trends Report" will analyze ten AI trends that are releasing significant potential, considering factors like technological maturity and practical application [118] Group 4: Event Details - The MEET2026 Smart Future Conference is scheduled for December 10, 2025, at the Beijing Jinmao Renaissance Hotel, with registration now open [119] - The conference aims to attract thousands of tech professionals and millions of online viewers, establishing itself as an annual barometer for the smart technology industry [122]
AI产业速递:从DeepSeek V3
2025-12-03 02:12
Summary of Key Points from the Conference Call Industry and Company Overview - The conference call discusses advancements in the AI industry, specifically focusing on the Deepseek V3.2 model developed by DeepMind, which showcases significant improvements in reinforcement learning and inference efficiency [1][3][5]. Core Insights and Arguments - **Model Architecture and Mechanisms**: Deepseek V3.2 introduces the Dynamic Spatial Attention (DSA) mechanism, replacing the previous Multi-Level Attention (MLA) mechanism. DSA optimizes computational efficiency by focusing on key attention parameters, particularly in complex tasks [3][5]. - **Performance Enhancements**: The C9 version of Deepseek V3.2 utilizes approximately 10% of the pre-training computational resources to significantly enhance its performance in complex tasks, such as code debugging, achieving a global leading level [1][3]. - **Context Management Strategy**: The model employs an efficient context management strategy that intelligently handles frequent task switching, multi-turn dialogues, and ambiguous inputs, effectively reducing inference costs [1][3]. - **Synthetic Data Utilization**: The training process for Deepseek V3.2 incorporates a substantial amount of high-difficulty synthetic data, which has doubled compared to previous versions. This data is crucial for the subsequent reinforcement learning phase and requires significant computational resources [1][6]. - **Open Source Innovations**: Deepseek has made strides in open-source capabilities by completing a comprehensive post-training process and supporting agent invocation, potentially leveling the playing field with closed-source models [7]. Additional Important Insights - **Reinforcement Learning Developments**: The evolution of reinforcement learning techniques has been marked by the introduction of human prompts based on Rubik's rules, enhancing the model's ability to think and execute simultaneously, thus improving overall efficiency [8][9]. - **Future of Model Pricing**: It is anticipated that by 2026, the cost of models will significantly decrease, potentially dropping to one-fifth of current prices due to advancements in technology and competitive pricing strategies among vendors [2][20]. - **Impact of Sparsity Techniques**: The implementation of sparsity techniques is expected to lower training computational requirements while increasing the upper limits of model training, encouraging more startups to engage in large model development [2][19]. - **Vertical Scene Task Solutions**: The application of reinforcement learning in e-commerce platforms illustrates the model's ability to adapt recommendations based on user feedback through multi-turn dialogue mechanisms, enhancing user satisfaction [12]. Conclusion - The advancements in Deepseek V3.2 highlight a significant shift in the AI landscape, emphasizing the importance of efficient computational mechanisms, the role of synthetic data, and the potential for open-source models to compete with proprietary solutions. The expected decrease in model costs and the rise of new startups indicate a dynamic and evolving market landscape [1][2][20].
DeepSeekV3.2技术报告还是老外看得细
量子位· 2025-12-03 00:11
Core Insights - The article discusses the launch of two open-source models, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, which have gained significant attention in Silicon Valley, indicating a shift in the competitive landscape of AI models [2][6]. Group 1: Model Performance - DeepSeek-V3.2 has achieved the highest level among current open-source models, significantly narrowing the gap with top closed-source models [6]. - The standard version of DeepSeek-V3.2 reached performance levels comparable to GPT-5, while the Speciale version surpassed GPT-5 and competed closely with Gemini-3.0-Pro in mainstream reasoning tasks [7][8]. - DeepSeek-V3.2-Speciale won gold medals in various competitions, demonstrating its advanced capabilities [9]. Group 2: Technical Innovations - The model utilizes DSA sparse attention to address efficiency issues with long contexts, laying the groundwork for subsequent long-sequence reinforcement learning [14]. - By introducing scalable reinforcement learning and allocating over 10% of pre-training compute for post-training, the model significantly enhances general reasoning and agent capabilities [15]. - The Speciale version allows for extended reasoning chains, enabling deeper self-correction and exploration, which unlocks stronger reasoning abilities without increasing pre-training scale [16][17]. Group 3: Economic Implications - DeepSeek-V3.2 is approximately 24 times cheaper than GPT-5 and 29 times cheaper than Gemini 3 Pro in terms of output token costs [29][30]. - The cost of using DeepSeek-V3.2 for generating extensive content is significantly lower, making it an economically attractive option compared to its competitors [31][32]. - The model's deployment on domestic computing power (e.g., Huawei, Cambricon) could further reduce inference costs, posing a challenge to established players like Google and OpenAI [36]. Group 4: Market Impact - The success of DeepSeek-V3.2 challenges the notion that open-source models lag behind closed-source ones, indicating a potential shift in market dynamics [10][26]. - The article highlights that the gap between DeepSeek and top models is now more of an economic issue rather than a technical one, suggesting that with sufficient resources, open-source models can compete effectively [26].