机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

700万参数击败DeepSeek R1等，三星一人独作爆火，用递归颠覆大模型推理

机器之心· 2025-10-09 04:43

机器之心报道对 HRM 感兴趣的读者可以参考我们之前的报道。编辑：冷猫 Training Small, Thinking Big. 大模型的推理架构颠覆的未免有些太快了。今年 6 月，来自 Sapient Intelligence 的研究者提出了分层推理模型（HRM），用循环架构打破了传统思维链（CoT）的架构限制，对大模型推理结构产生了重大的影响。 HRM 仅包含 2700 万个参数（大约比最小的 Qwen3 0.6B 模型小 22 倍），仅使用 1000 个训练样本，便在复杂的推理任务上取得了卓越的性能。仅仅过了四个月，HRM 的架构就彻底不够看了。来自加拿大蒙特利尔三星先进技术研究所（SAIT）的高级 AI 研究员 Alexia Jolicoeur-Martineau 介绍了微型递归模型（TRM）。这个 TRM 有多离谱呢？一个仅包含 700 万个参数（比 HRM 还要小 4 倍）的网络，在某些最困难的推理基准测试中，其参数数量与 o3-mini 和 Gemini 2.5 Pro 等尖端语言模型相比，甚至可以超越它们，尽管这些模型的参数数量是 TRM 的 10,000 倍。 ...

Qwen要做机器人了：林俊旸官宣成立具身智能团队

机器之心· 2025-10-09 04:43

Core Insights - Qwen, a leader in open-source models, is transitioning into robotics by forming a dedicated team for embodied AI, indicating a shift from virtual to physical applications of their models [1][8] - The establishment of this robotics team aligns with Alibaba Cloud's broader strategy to support the embodied intelligence sector, leveraging their existing AI capabilities [8][12] Group 1: Company Developments - Alibaba's Qwen has initiated a robotics team to enhance its models' capabilities in real-world applications, focusing on long-horizon reasoning and tool utilization through reinforcement learning [1][8] - The recent funding of nearly 1 billion yuan for a robotics company, with Alibaba Cloud as a lead investor, marks a significant investment in the embodied intelligence space [5][8] - Qwen's models, particularly Qwen-VL, are being widely adopted by companies in the embodied intelligence sector for their strengths in spatial understanding and long-context memory [6][8] Group 2: Market Trends - The global robotics market is projected to reach $7 trillion by 2050, attracting significant investment from various sectors, including government funds [12] - Major tech companies, including NVIDIA and SoftBank, are heavily investing in robotics, indicating a competitive landscape where the integration of generative AI and robotics is expected to transform human-machine interactions [9][10][11]

机器之心· 2025-10-09 02:24

Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].

机器人「看片」自学新技能：NovaFlow从生成视频中提取动作流，实现零样本操控

机器之心· 2025-10-09 02:24

本文共同第一作者为李鸿宇（布朗大学博士生）和孙凌峰（Robotics and AI Institute 研究员，博士毕业于加州大学伯克利分校）。通讯作者付佳慧在 Robotics and AI Institute 任研究员，博士毕业于麻省理工学院。George Konidaris 为布朗大学副教授。构建能够在新环境中、无需任何针对性训练就能执行多样化任务的通用机器人，是机器人学领域一个长期追逐的圣杯。近年来，随着大型语言模型（LLMs）和视觉语言模型（VLMs）的飞速发展，许多研究者将希望寄托于视觉 - 语言 - 动作（VLA）模型，期望它们能复刻 LLM 和 VLM 在泛化性上取得的辉煌。然而，理想很丰满，现实却很骨感。VLA 模型的端到端训练范式，要求海量与特定机器人相关的 "视觉 - 语言 - 动作" 数据。与 LLM 和 VLM 可以轻易获取的网络规模数据不同，机器人数据的采集成本极高、难度极大，这形成了一个巨大的 "数据瓶颈"。有没有可能绕过这个瓶颈，让机器人不依赖于昂贵的 "亲身经历" 数据，也能学会新技能呢？最近，来自布朗大学（Brown University）和机器人与人工智能研究 ...

Being-VL的视觉BPE路线：把「看」和「说」真正统一起来

机器之心· 2025-10-09 02:24

Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations with text space, leading to potential hallucinations when detailed, non-language-dependent queries are made [2][6] - A new method called Being-VL is proposed, which emphasizes a post-alignment approach, allowing for the discrete representation of images before aligning them with text, thereby preserving visual structure and reducing the risk of information loss [2][3] Being-VL Implementation - Being-VL consists of three main steps: quantifying images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [3][10] - The visual BPE tokenizer prioritizes both frequency and spatial consistency to create a more semantically and structurally meaningful token set, which is independent of text [8][9] Training Strategy - The training process is divided into three stages: 1. **Embedding Alignment**: Only the new visual token embeddings are trained while freezing other parameters to maintain existing language capabilities [12] 2. **Selective Fine-tuning**: A portion of the LLM layers is unfrozen to facilitate cross-modal interaction at lower representation levels [12] 3. **Full Fine-tuning**: All layers are unfrozen for comprehensive training on complex reasoning and instruction data [12][10] Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE and unified modeling with text leads to improved reliability in detail-sensitive queries and reduces hallucinations compared to traditional methods [14][16] - The study highlights the importance of a gradual training approach, showing that a combination of progressive unfreezing and curriculum learning significantly outperforms single-stage training methods [14][10] Visual BPE Token Activation - Visualization of embedding weights shows that using visual BPE leads to a more balanced distribution of weights between text and visual tokens, indicating reduced modality gaps and improved cross-modal attention [16][19] Token Size and Training Efficiency - The research explores the impact of BPE token size on training efficiency, finding an optimal balance in resource-limited scenarios, while larger token sizes may lead to diminishing returns due to sparsity [19][20] Development and Summary - The evolution from Being-VL-0 to Being-VL-0.5 reflects enhancements in the unified modeling framework, incorporating priority-guided encoding and a structured training approach [20][24]

视觉BPE

多模态模型

跨模态交互

Artificial Intelligence

Artificial Intelligence

Being-VL

更大，还能更快，更准！蚂蚁开源万亿参数语言模型Ling-1T，刷新多项SOTA

机器之心· 2025-10-09 02:24

Core Insights - The article discusses the launch of Ling-1T, a trillion-parameter open-source language model by Ant Group, highlighting its efficiency and performance in various benchmarks [2][5][52]. Group 1: Model Performance - Ling-1T has achieved impressive results in multiple benchmark tests, outperforming several leading models in key areas such as knowledge understanding and reasoning [6][9][10]. - In coding and math reasoning tasks, Ling-1T consistently ranks among the top performers, demonstrating strong logical consistency and cross-domain reasoning capabilities [8][11]. - The model's performance in specific benchmarks includes a score of 92.19 in C-Eval and 87.45 in FinanceReasoning, indicating its high knowledge density and reasoning ability [9][10]. Group 2: Efficiency and Architecture - Ling-1T utilizes a Mixture of Experts (MoE) architecture, allowing it to maintain high reasoning capabilities while significantly reducing computational costs [5][52]. - The model operates on a paradigm of "large parameter reserves + small parameter activation," enabling it to handle complex problems efficiently with a lower energy footprint [53][54]. - It supports a context length of 128K, enhancing its ability to process long documents without losing context, which is crucial for industries like finance and law [62]. Group 3: Open Source Philosophy - The article emphasizes the importance of open-source models in the AI landscape, suggesting that they enable faster iteration and lower costs for technology development [72][73]. - Ant Group's approach to open-sourcing Ling-1T allows for broader accessibility and collaboration, fostering an ecosystem where developers and small businesses can participate [74][75]. - The open-source model not only democratizes access to advanced AI capabilities but also enhances transparency and trust in AI applications across various sectors [72][74].

重磅｜清华物理系传奇姚顺宇离职，不认同Anthropic，加入DeepMind

机器之心· 2025-10-08 04:13

机器之心报道机器之心编辑部最新消息，清华物理系传奇特奖得主 Yao Shunyu（姚顺宇）离开 Anthropic，加入 Google DeepMind。根据姚顺宇在博客上发表的文章得知，他于 9 月 19 日从 Anthropic 正式离职，9 月 29 日加入 Google DeepMind。是的，不是姚顺雨，而是姚顺宇，前者是学计算机出身，也是著名的《AI 下半场》作者，而后者是学物理出身，且在本科期间就名声大噪。资料显示，姚顺宇于 2015 年进入清华大学物理系，大二开始选修研究生理论课程，在周期驱动系统拓扑场论领域，提出非厄米系统中拓扑能带理论的新方法，并准确预测相关现象，相关研究成果发表在世界物理顶级期刊 Phys. Rev. Lett. 上。其在物理学研究上的卓越成就让一位 211 大学副教授也不禁感叹：「我们这边即使是教授，也没有能超过姚顺宇同学目前本科期间的物理水平的。」图源：知乎 @ 林晨 2019 年，姚顺宇清华大学本科毕业后远赴斯坦福攻读博士，毕业后先是到加州伯克利大学做了一段时间的博士后，之后于 2024 年 10 月 1 日加入 Anthropic 的 Clau ...

Artificial Intelligence

Quantum Computing

Artificial Intelligence

Claude

Artificial Intelligence

Quantum Computing

Artificial Intelligence

Claude

谷歌大神出手，免费发布《智能体设计模式》，AI Agent开发的终极秘籍

机器之心· 2025-10-08 04:13

Core Insights - The article discusses the rising trend of AI Agents, emphasizing the need for systematic design patterns to address challenges in developing reliable and stable intelligent systems [2][6][20]. Summary by Sections Introduction - The introduction highlights the evolution of AI from simple reactive programs to complex autonomous entities capable of understanding context and making decisions [14][15]. Book Overview - Antonio Gulli's book "Agentic Design Patterns" aims to provide a structured approach to developing AI agents, offering reusable solutions to common design challenges [4][6][22]. Structure of the Book - The book is organized into four parts, starting with fundamental operations and advancing to complex topics like multi-agent collaboration and safety measures [11][12][21]. Importance of Design Patterns - Design patterns are crucial in AI development as they offer proven templates to tackle common challenges, enhancing the structure, maintainability, and reliability of intelligent systems [20][21]. Key Features of Intelligent Agent Systems - Intelligent agent systems are characterized by autonomy, proactivity, and reactivity, allowing them to make decisions and act without continuous human supervision [19][17]. Practical Application - The book emphasizes practical application, providing code examples and encouraging readers to experiment with the concepts presented [22][23]. Conclusion - The book serves as a foundational resource for understanding and applying core design patterns in AI development, aiming to stabilize the rapidly evolving field [24][26].

《Agentic Design Patterns（智能体设计模式）》

AI Agent（智能体）

设计模式（Design Pattern）

Software

《Agentic Design Patterns（智能体设计模式）》

开源RL框架Verlog来了，专为LLM智能体打造，400回合不成问题

机器之心· 2025-10-08 04:13

它在继承 VeRL 和 BALROG 的基础上，并遵循 pytorch-a2c-ppo-acktr-gail 的成熟设计原则，引入了一系列专门优化手段，从而在任务跨度从短暂交互到数百回合时，依然能够实现稳定而高效的训练。以往的框架（如 VeRL 和 RAGEN）能够较好地处理约 10 回合的任务，verl-agent 则可扩展至 50 回合。而 Verlog 则被设计用于超过 400 回合的环境，使其在复杂的长期决策任务中具备独特优势。这一能力已在 BabyAI、BabaIsAI 和 Crafter 等高难度领域得到验证。以 Crafter 为例，其回合长度范围在 70 到 400 步之间，平均约为 190 步。在这些充满挑战的环境中，Verlog 都能够开箱即用地展现出强劲的性能。机器之心报道机器之心编辑部 AI 时代，智能体对短期对话的处理能力已不再是难题。真正的挑战是让智能体在数百步的探索中依然保持清晰的推理与稳健的决策。传统的强化学习框架在几十步内尚能应付，但一旦任务延展至数百步，奖励稀疏、历史冗长、策略崩塌便接踵而至。为了应对这些挑战，来自卡内基梅隆大学、香港大学等机构的研究者提出 ...

谷歌加入CUA战场，发布Gemini 2.5 Computer Use：让AI直接操作浏览器

机器之心· 2025-10-08 03:18

Core Insights - Google DeepMind has launched the Gemini 2.5 Computer Use model, which allows AI to directly control user browsers, similar to OpenAI's Computer-Using Agent (CUA) [1][25] - The model demonstrates state-of-the-art (SOTA) performance in various benchmarks, outperforming competitors in several tasks [6][25] Benchmark Performance - Gemini 2.5 Computer Use achieved notable scores in benchmark tests, such as: - Online-Mind2Web: 69.0% accuracy - Measured by Browserbase: 65.7% accuracy - WebVoyager: 88.9% self-reported accuracy - AndroidWorld: 69.7% accuracy [7] Speed and Accuracy - The model exhibits high accuracy and speed in completing tasks, effectively gathering information and organizing notes [5][9] - However, it struggles with more complex tasks, indicating limitations in its current capabilities [9][11] User Interaction and Workflow - Users can access the model's capabilities through Google AI Studio and Vertex AI's Gemini API, with a demo environment available for testing [13] - The model operates in a loop, analyzing user inputs and generating UI action function calls, with safety mechanisms in place to confirm actions [19][21] Safety Mechanisms - Google has integrated safety measures during the training phase to mitigate risks associated with AI controlling computers, including user misuse and unexpected model behavior [23][26] - Developers are provided with options to prevent the model from executing potentially harmful actions [24][26] Industry Implications - The introduction of Gemini 2.5 Computer Use signals a competitive shift in the AI agent landscape, with major tech companies vying to redefine human-computer interaction [25]

AI智能体

人工智能

Gemini 2.5 Computer Use

AI智能体

人工智能

Gemini 2.5 Computer Use

Previous Next