Workflow
强化学习
icon
Search documents
每2秒吃透一道高数大题!华为终于揭秘准万亿MoE昇腾训练系统全流程
华尔街见闻· 2025-05-30 09:38
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" system, demonstrating a fully domestic and GPU-free training process that enhances computational efficiency and model performance [3][4][38]. Group 1: Technical Innovations - Huawei's training system has achieved a model training efficiency with a utilization rate (MFU) of 41% during the pre-training phase using the Ascend Atlas 800T A2 cluster [4][38]. - The Pangu Ultra MoE model consists of 718 billion parameters, featuring a unique architecture with 61 layers, including 58 MoE layers, and is designed for high performance and scalability [38][39]. - The system supports a high throughput of 35K Tokens/s during the reinforcement learning (RL) post-training phase, showcasing its capability to process complex tasks rapidly [39]. Group 2: Challenges Addressed - The report identifies six key challenges in the current MoE pre-training and RL post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, and uneven system load distribution [7][10][12][13]. - Huawei has developed a comprehensive end-to-end solution to address these challenges, focusing on optimizing training cluster utilization and enhancing communication efficiency [14][16][25]. Group 3: Specific Solutions - The first strategy involves improving training cluster utilization through intelligent parallel strategy selection and global dynamic load balancing, significantly enhancing overall training efficiency [16][23]. - The second strategy focuses on releasing computational power at the single-node level by optimizing training operators and enhancing memory management, achieving a twofold increase in micro-batch size [26][30]. - The third strategy introduces high-performance scalable RL post-training technologies, allowing for flexible deployment modes and doubling the utilization rate of RL post-training clusters [33][34].
机器狗能当羽毛球搭子了!仅靠强化学习从0自学,还涌现出类人回位行为 | Science子刊
量子位· 2025-05-30 07:10
衡宇 发自 凹非寺 量子位 | 公众号 QbitAI 来和机器狗一起运动不?你的羽毛球搭子来了! 无需人工协助,仅靠强化学习 ,机器狗子就学会了羽毛球哐哐对打,就像这样—— 在室外: 在室内: 都不在话下。 基于强化学习,研究人员开发了机器狗的全身视觉运动控制策略,同步控制腿部 (18个自由度) 移动,和手臂挥拍动作。 最终呈现出来的表现不赖,狗子最高挥拍速度达到12米/秒。 在与人类选手的协作比赛中, 某一回合连续击球10次 ,甚至涌现出如击球后回位中心的类人行为。 该研究在各种环境中进行了大量实验,验证了四足机器人预测羽毛球轨迹、有效导航服务区域,以及对人类球员进行最精准打击的能力。 证明了足式移动机器人在复杂和动态的体育场景中应用的可行性 。 研究背后团队来自 苏黎世联邦理工学院 。 相关论文刚刚发表在Science旗下子刊Science Robotics上。 然后生成关键指令,来控制四足底座。 羽毛球"大战"中涌现出类人行为 学会打羽毛球的机器狗是什么配置? 公开数据如下: 主体由 一个四足ANYmal-D底座 和 一个动态手臂DynaArm 组成。 它 配备了一个带有全局快门的ZED X立体相机用于 ...
成本暴降88%!通义实验室、北大发布ZeroSearch,无需搜索即可激活LLM检索能力
机器之心· 2025-05-29 04:53
Core Insights - The article introduces the ZeroSearch framework, which enables large language models (LLMs) to activate their search capabilities without relying on real search engines, significantly reducing training costs by 88% while outperforming methods that depend on actual search engines [1][21]. Methodology - ZeroSearch employs a reinforcement learning (RL) framework that utilizes a simulation LLM as a search engine, eliminating the need for real-time API interactions, thus lowering training costs [4][6]. - The framework incorporates a structured training template that guides the model through each interaction, enhancing the clarity and interpretability of the reasoning process [8]. - A loss masking technique is applied to prevent the strategy model from memorizing documents generated by the simulation LLM, ensuring that only tokens generated by the strategy model are considered for loss calculation [4][8]. Training Strategy - The training process begins with a gradual increase in difficulty, allowing the model to learn basic output formats and task logic before rapidly escalating the challenge to enhance reasoning capabilities [22][36]. - A curriculum learning strategy is implemented, progressively lowering the quality of generated documents to stimulate the model's reasoning ability effectively [13][36]. Experimental Results - ZeroSearch demonstrates superior performance across various datasets, achieving an average score of 40.93 in multi-hop question answering tasks, surpassing all baseline methods [20][21]. - The framework shows robust generalization capabilities, with performance improving as model parameters increase, indicating strong scalability [23][27]. - In comparison to real search engines, ZeroSearch exhibits a significant potential to replace them in large-scale RL applications, showcasing its effectiveness in enhancing search capabilities [21][24]. Conclusion - The ZeroSearch framework effectively activates the search capabilities of LLMs without the need for real search engines, demonstrating strong adaptability and scalability across different RL algorithms [36].
Claude 4 核心成员访谈:提升 Agent 独立工作能力,强化模型长程任务能力是关键
Founder Park· 2025-05-28 13:13
Core Insights - The main change expected in 2025 is the effective application of reinforcement learning (RL) in language models, particularly through verifiable rewards, leading to expert-level performance in competitive programming and mathematics [4][6][7]. Group 1: Reinforcement Learning and Model Development - Reinforcement learning has activated existing knowledge in models, allowing them to organize solutions rather than learning from scratch [4][11]. - The introduction of Opus 4 has significantly improved context management for multi-step actions and long-term tasks, enabling models to perform meaningful reasoning and execution over extended periods without frequent user intervention [4][32]. - The current industry trend prioritizes computational power over data and human feedback, which may evolve as models become more capable of learning in real-world environments [4][21]. Group 2: Future of AI Agents - The potential for AI agents to automate intellectual tasks could lead to significant changes in the global economy and labor market, with predictions of "plug-and-play" white-collar AI employees emerging within the next two years [7][9]. - The interaction frequency between users and models is expected to shift from seconds and minutes to hours, allowing users to manage multiple models simultaneously, akin to a "fleet management" approach [34][36]. - The development of AI agents capable of completing tasks independently is anticipated to accelerate, with models expected to handle several hours of work autonomously by the end of the year [36][37]. Group 3: Model Capabilities and Limitations - Current models still lack self-awareness in the philosophical sense, although they exhibit a form of meta-cognition by expressing uncertainty about their answers [39][40]. - The models can simulate self-awareness but do not possess a continuous identity or memory unless explicitly designed with external memory systems [41][42]. - The understanding of model behavior and decision-making processes is still evolving, with ongoing research into mechanisms of interpretability and the identification of features that drive model outputs [46][48]. Group 4: Future Developments and Expectations - The frequency of model releases is expected to increase significantly, with advancements in reinforcement learning leading to rapid improvements in model capabilities [36][38]. - The exploration of long-term learning mechanisms and the ability for models to evolve through practical experience is a key area of focus for future research [30][29]. - The ultimate goal of model interpretability is to establish a clear understanding of how models make decisions, which is crucial for ensuring their reliability and safety in various applications [46][47].
三位顶流AI技术人罕见同台,谈了谈AI行业最大的「罗生门」
3 6 Ke· 2025-05-28 11:59
Core Insights - The AI industry is currently experiencing a significant debate over the effectiveness of pre-training models versus first principles, with notable figures like Ilya from OpenAI suggesting that pre-training has reached its limits [1][2] - The shift from a consensus-driven approach to exploring non-consensus methods is evident, as companies and researchers seek innovative solutions in AI [6][7] Group 1: Industry Trends - The AI landscape is witnessing a transition from a focus on pre-training to exploring alternative methodologies, with companies like Sand.AI and NLP LAB leading the charge in applying multi-modal architectures to language and video models [3][4] - The emergence of new models, such as Dream 7B, demonstrates the potential of applying diffusion models to language tasks, outperforming larger models like DeepSeek V3 [3][4] - The consensus around pre-training is being challenged, with some experts arguing that it is not yet over, as there remains untapped data that could enhance model performance [38][39] Group 2: Company Perspectives - Ant Group's Qwen team, led by Lin Junyang, has faced criticism for being conservative, yet they emphasize that their extensive experimentation has led to valuable insights, ultimately reaffirming the effectiveness of the Transformer architecture [5][15] - The exploration of Mixture of Experts (MoE) models is ongoing, with the team recognizing the potential for scalability while also addressing the challenges of training stability [16][20] - The industry is increasingly focused on optimizing model efficiency and effectiveness, with a particular interest in achieving a balance between model size and performance [19][22] Group 3: Technical Innovations - The integration of different model architectures, such as using diffusion models for language generation, reflects a broader trend of innovation in AI [3][4] - The challenges of training models with long sequences and the need for effective optimization strategies are critical areas of focus for researchers [21][22] - The potential for future breakthroughs lies in leveraging increased computational power to revisit previously unviable techniques, suggesting a cycle of innovation driven by advancements in hardware [40][41]
前小米智驾刘方:如果VLA跑通,自动驾驶会变成具身智能子问题|36氪专访
3 6 Ke· 2025-05-28 04:18
"VLA是一个像人类司机一样工作的司机大模型。"5月7日晚,理想汽车CEO李想在AI Talk中说道。 这是智能驾驶行业继"端到端"之后,出现的最新技术方向。 VLA(Vision-Language-Action,视觉语言动作)模型,最早由谷歌AI公司Deepmind推出,主要用于机 器人领域,随后逐渐成为具身智能领域的主流技术范式与框架,Open AI、字节跳动等公司都在践行这 个路线。 与ChatGPT、Sora等注重文本、图像与视频的视觉语言模型(VLM)不同,VLA在前者的基础上,新增 了与物理世界交互的"动作"能力。 换句话说,VLA不仅理解周围环境,还能直接输出控制指令,如机器人动作或车辆驾驶决策等。智能 驾驶、具身智能两大热门赛道也因此有了更深刻的交汇。 VLA、强化学习等新技术落地,正在带来新的思路。 例如,VLA(视觉语言动作)模型中的VLM(视觉语言模型),本身就具备了认识世界的能力。"VLM 的性能决定VLA超过一半的性能,VLA大部分工作其实就是在VLM上做增强。"刘方表示。 除了具备看图说话、能感知距离之外,VLA更关键的一步,是最后的动作环节。"好比买家具回来组 装,首先读一下说明 ...
腾讯研究院AI速递 20250528
腾讯研究院· 2025-05-27 15:44
Group 1 - UAE becomes the first country to offer free access to ChatGPT Plus for all citizens, part of a collaboration with OpenAI [1] - Abu Dhabi will establish the Stargate UAE high-performance AI data center, supporting a 1 GW computing cluster with an initial target of 200 MW capacity [1] - The collaboration is part of OpenAI's "nation-focused" initiative, with UAE committing to match US funding, potentially totaling up to $20 billion [1] Group 2 - OpenAI has enabled singing capabilities for GPT-4o, seen as a response to Google's Gemini 2.5 Pro and Veo3 releases [2] - Google's Gemini 2.5 Pro has outperformed OpenAI and Claude models in several benchmark tests [2] - Analysts believe that the singing feature of GPT-4o is insufficient to regain market leadership, emphasizing the need for OpenAI to launch GPT-5 soon [2] Group 3 - Claude Opus successfully solved a stubborn bug that had troubled a veteran C++ engineer for four years, taking only a few hours [3] - The AI identified the root cause of the issue through analysis of code libraries and architecture comparisons, which had previously stumped other models [3] - Despite its debugging prowess, AI is still considered to be at a beginner level in writing new code [3] Group 4 - French non-profit AI research organization Kyutai launched Unmute, a modular voice AI system that can quickly add voice interaction capabilities to any text LLM [4] - Unmute features low latency (200-350 ms), streaming speech-to-text and text-to-speech, full-duplex interaction, and 10-second voice cloning, supporting over 70 emotional styles [5] - Kyutai plans to fully open-source Unmute in the coming weeks, including STT (1B parameters) and TTS (2B parameters) models and code [5] Group 5 - Alibaba Tongyi launched QwenLong-L1-32B, a large model addressing long-context reasoning issues, with a maximum context length of 130,000 tokens [6] - The team identified two core challenges: low training efficiency and instability, proposing progressive context expansion techniques and a mixed reward mechanism [6] - QwenLong-L1-32B outperforms models like OpenAI-o3-mini and Qwen3-235B-A22B, showing significant advantages in long document analysis [6] Group 6 - Mita AI Search introduced a new "Ultra" model, achieving a response speed of 400 tokens per second, with most queries answered within 2 seconds [7] - The new model utilizes kernel fusion on GPUs and dynamic compilation optimization on CPUs, achieving performance breakthroughs on a single H800 GPU [7] - Mita offers both "Ultra" and "Ultra·Thinking" modes optimized for different types of questions, along with a temporary speed test site for user experience [7] Group 7 - Thunderbird officially released the AI glasses X3 Pro, featuring a custom large model and full-color display, priced at 8,999 yuan [8] - The X3 Pro utilizes a 4nm Qualcomm Snapdragon AR1 platform and proprietary Firefly light engine with RayNeo waveguide technology, achieving a brightness of 3,500 nits (peak 6,000 nits) and weighing only 76g [8] - The product is available for pre-order and will ship on June 15, supporting AI Agent store and real-world navigation features [8] Group 8 - The core team of Meta's Llama faces significant talent loss, with 11 out of 14 core authors having left, leaving only 3 remaining [10] - Among the departed, 5 joined the French AI open-source startup Mistral, including two main architects of Llama [10] - Meta is under pressure from open-source models like DeepSeek and Qwen, despite investing billions, lacking a dedicated "inference" model [10] Group 9 - The Beihang University team proposed the "Flying-on-a-Word" (Flow) task, enabling drone control through language commands, filling a gap in low-level language interaction control research [11] - The team constructed the UAV-Flow benchmark dataset, containing 30,000 real-world flight trajectories across eight major movement types [11] - The research addressed drone computational limitations by performing model inference at the ground station and providing real-time feedback for control commands [11] Group 10 - NVIDIA experts recommend that students integrate multiple skills and enhance adaptability, not limited to computer science backgrounds, to stand out in the job market [12] - Job seekers should clarify their interests in the AI field, responsibly use AI tools, and build industry connections for career development opportunities [12] - Candidates can showcase their technical abilities, professional knowledge, and innovative thinking through project examples to excel in interviews [12]
One RL to See Them All?一个强化学习统一视觉-语言任务!
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the introduction of V-Triune, a unified reinforcement learning system by MiniMax that enhances visual-language models (VLM) for both visual reasoning and perception tasks in a single training process [2][4][5]. Group 1: V-Triune Overview - V-Triune consists of three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring, which work together to handle diverse tasks [3][8]. - The system utilizes a novel dynamic IoU reward mechanism that provides adaptive feedback for perception tasks, leading to performance improvements in reasoning and perception tasks [3][4]. Group 2: Performance Improvements - Orsta, the model generated by V-Triune, achieved significant performance gains in the MEGA-Bench Core benchmark, with improvements ranging from +2.1 to +14.1 across different model variants [4][49]. - The model's training on diverse datasets covering various visual reasoning and perception tasks has contributed to its broad capabilities [3][49]. Group 3: Sample-Level Data Formatting - MiniMax addresses the challenge of different tasks requiring distinct reward types and configurations by defining rewards at the sample level, allowing for dynamic routing and fine-grained weighting during training [9][13][16]. - This design enables seamless integration of diverse datasets into a unified training process while allowing for flexible and scalable reward control [16]. Group 4: Verifier-Level Reward Computation - MiniMax employs an independent, asynchronous reward server for generating reinforcement learning signals, enhancing modularity and scalability [17][19]. - The architecture allows for easy addition of new tasks or updates to reward logic without modifying the core training process [20]. Group 5: Source-Level Metric Monitoring - The Source-Level Metric Monitoring strategy records key performance indicators by data source for each training batch, facilitating targeted debugging and insights into the interactions between different data sources [21][24]. - Key monitored metrics include dynamic IoU rewards, perception task IoU/mAP, response length, and reflection rate, all tracked continuously by data source [24][22]. Group 6: Dynamic IoU Reward Strategy - The dynamic IoU reward strategy adjusts the IoU threshold during training to balance learning efficiency and final accuracy, starting with a relaxed threshold and progressively tightening it [26][25]. - This approach aims to guide the model's learning process smoothly while ensuring high performance in the later stages of training [26]. Group 7: Training Methodology - MiniMax's V-Triune supports scalable data, tasks, validators, and metrics systems, but early experiments indicated that joint training could lead to instability [28][29]. - To address this, MiniMax implemented targeted adjustments, including freezing ViT parameters to prevent gradient explosion and managing memory during large-scale training [34][35]. Group 8: Experimental Results - MiniMax conducted experiments using Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct as base models, achieving a dataset comprising 20,600 perception samples and 27,100 reasoning samples [46]. - The results indicate that V-Triune significantly enhances performance in reasoning and perception tasks, particularly in areas with rich training data [49][55]. Group 9: Conclusion - Overall, MiniMax's findings suggest that reinforcement learning can effectively enhance visual reasoning and perception capabilities within a unified framework, demonstrating continuous performance improvements across various tasks [55][56].
《科学智能白皮书2025》发布,中国引领AI应用型创新领域
Di Yi Cai Jing· 2025-05-26 13:27
Core Insights - By 2024, China's AI-related paper citation volume is expected to account for 40.2% of the global total, rapidly catching up to the United States at 42.9% [1][8] - The report titled "Scientific Intelligence White Paper 2025" analyzes the integration of AI and scientific research across seven major research fields, covering 28 directions and nearly 90 key issues [1] - The report highlights the dual promotion and deep integration of AI innovation and scientific research, termed "AI for Science" [1] Research Trends - The number of global AI journal papers has surged nearly threefold over the past decade, from 308,900 to 954,500, with an average annual growth rate of 14% [7] - The share of core AI fields, such as algorithms and machine learning, has decreased from 44% to 38%, while the share of scientific intelligence has increased by 6 percentage points, with an annual growth rate rising from 10% before 2020 to 19% after [7] - China’s AI publication volume increased from 60,100 in 2015 to 300,400 in 2024, representing 29% of the global total [7][8] Citation Impact - The citation volume of AI-related papers in the U.S. reached 302,200 in 2020, while China's citations rose from 10,300 in 2015 to 144,800 in 2020, surpassing the EU for the first time in 2021 [8] - By 2024, China is projected to account for 41.6% of global AI citations in patents, policy documents, and clinical trials, significantly leading the field [8] Country-Specific Trends - China has a leading position in the intersection of AI with earth and environmental sciences, and has surpassed in AI with mathematics, material sciences, and humanities since 2019 [9] - The U.S. and EU maintain advantages in AI and life sciences, with China ranking third in this area [9] - India shows significant progress across all fields, currently ranking third in earth and environmental sciences, engineering, and humanities [9]
别只盯着7小时编码,Anthropic爆料:AI小目标是先帮你拿诺奖
3 6 Ke· 2025-05-26 11:06
Group 1 - Anthropic has released its latest model, Claude 4, which is claimed to be the strongest programming model currently available, capable of continuous coding for up to 7 hours [1] - The interview with Anthropic researchers highlights significant advancements in AI research over the past year, particularly in the application of reinforcement learning (RL) to large language models [3][5] - The researchers discussed the potential of a new generation of RL paradigms and how to understand the "thinking process" of models, emphasizing the need for effective feedback mechanisms [3][9] Group 2 - The application of RL has achieved substantial breakthroughs, enabling models to reach "expert-level human performance" in competitive programming and mathematical tasks [3][5] - Current limitations in model capabilities are attributed to context window restrictions and the inability to handle complex tasks that span multiple files or systems [6][8] - The researchers believe that with proper feedback loops, models can perform exceptionally well, but they struggle with ambiguous tasks that require exploration and interaction with the environment [8][10] Group 3 - The concept of "feedback loops" has emerged as a critical technical breakthrough, with a focus on "reinforcement learning from verified rewards" (RLVR) as a more effective training method compared to human feedback [9][10] - The researchers noted that the software engineering domain is particularly suited for providing clear validation and evaluation criteria, which enhances the effectiveness of RL [10][11] - The discussion also touched on the potential for AI to assist in significant scientific achievements, such as winning Nobel Prizes, before contributing to creative fields like literature [11][12] Group 4 - There is ongoing debate regarding whether large language models possess true reasoning abilities, with some suggesting that apparent new capabilities may simply be latent potentials being activated through reinforcement learning [13][14] - The researchers emphasized the importance of computational resources in determining whether models genuinely acquire new knowledge or merely refine existing capabilities [14][15] - The conversation highlighted the challenges of ensuring models can effectively process and respond to complex real-world tasks, which require a nuanced understanding of context and objectives [31][32] Group 5 - The researchers expressed concerns about the potential for models to develop self-awareness and the implications of this for their behavior and alignment with human values [16][17] - They discussed the risks associated with training models to internalize certain behaviors based on feedback, which could lead to unintended consequences [18][19] - The potential for AI to autonomously handle tasks such as tax reporting by 2026 was also explored, with the acknowledgment that models may still struggle with tasks they have not been explicitly trained on [21][22] Group 6 - The conversation addressed the future of AI models and their ability to communicate in complex ways, potentially leading to the development of a "neural language" that is not easily interpretable by humans [22][23] - The researchers noted that while current models primarily use text for communication, there is a possibility of evolving towards more efficient internal processing methods [23][24] - The discussion concluded with a focus on the anticipated bottlenecks in reasoning computation as AI capabilities advance, particularly in relation to the growth of computational resources and the semiconductor manufacturing industry [25][26] Group 7 - The emergence of DeepSeek as a competitive player in the AI landscape was highlighted, with the team effectively leveraging shared advancements in hardware and algorithms [27][28] - The researchers acknowledged that DeepSeek's approach reflects a deep understanding of the balance between hardware capabilities and algorithm design, contributing to their success [28][29] - The conversation also touched on the differences between large language models and systems like AlphaZero, emphasizing the unique challenges in achieving general intelligence through language models [31][32]