机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

LLM工业级自进化：北邮与腾讯AI Lab提出MoE-CL架构，解决大模型持续学习核心痛点

机器之心· 2025-09-30 00:27

在工业级大语言模型（LLM）应用中，动态适配任务与保留既有能力的 "自进化" 需求日益迫切。真实场景中，不同领域语言模式差异显著，LLM 需在学习新场景合规规则的同时，不丢失旧场景的判断能力。这正是大模型自进化核心诉求，即 "自主优化跨任务知识整合，适应动态环境而无需大量外部干预"。为解决此问题，北邮百家 AI 团队与腾讯 AI Lab 团队提出参数高效的对抗性混合专家架构 MoE-CL，专门用于 LLM 的自进化持续指令微调。其核心设计在于 "解耦 LoRA 专家" 与 "GAN 对抗降噪" 的结合：为每个任务配置专属 LoRA 专家以保留任务特定知识，避免参数更新相互干扰；同时设置共享 LoRA 专家，通过生成对抗网络（GAN）中的任务感知鉴别器抑制无关噪声，确保跨任务知识高效且精准传递，最终实现 "知识保留" 与 "跨任务泛化" 的平衡，这也是 LLM 自进化的核心逻辑。从实验效果来看，MoE-CL 的自进化能力已在实际场景与基准测试中得到验证。在腾讯真实业务场景 A/B 测试中，它将人工介入成本降低 15.3%；在公开 MTL5 跨域基准与工业级 Tencent3 基准测试中，其平均准确率 ...

Claude Sonnet 4.5来了！能连续编程30多小时、1.1万行代码

机器之心· 2025-09-30 00:27

Core Insights - The article discusses the recent advancements in AI models, particularly the release of Claude Sonnet 4.5 by Anthropic, which is positioned as a leading model in various benchmarks and applications [1][4][5]. Model Performance - Claude Sonnet 4.5 achieved significant performance improvements in various benchmarks, including: - 77.2% in Agentic coding [2] - 82.0% in SWE-bench Verified [2] - 61.4% in OSWorld for computer use, up from 42.2% in the previous version [11] - The model shows enhanced capabilities in reasoning and mathematics, with a perfect score of 100% in high school math competitions [12][13]. Developer Tools and Features - Anthropic introduced the Claude Agent SDK, allowing developers to create their own intelligent agents [4][35]. - New features include checkpoint functionality for saving progress, a revamped terminal interface, and native VS Code extensions [8][4]. Safety and Alignment - Claude Sonnet 4.5 is noted for being the most aligned model to human values, with improvements in reducing undesirable behaviors such as flattery and deception [27][5]. - The model is released under AI safety level 3 (ASL-3), incorporating classifiers to detect potentially dangerous inputs and outputs [32]. User Experience and Applications - Early user experiences indicate that Claude Sonnet 4.5 performs exceptionally well in specialized fields such as finance, law, and STEM [13][21]. - The "Imagine with Claude" feature allows real-time software generation without pre-defined functions, showcasing the model's adaptability [36][38].

Artificial Intelligence

Artificial Intelligence

强强联手！深度求索、寒武纪同步发布DeepSeek-V3.2模型架构和基于vLLM的模型适配源代码

机器之心· 2025-09-29 11:05

Core Viewpoint - The release of DeepSeek-V3.2 by DeepSeek Company and its adaptation by Cambricon signifies a strong collaboration among leading tech firms in China's AI industry, aiming to enhance efficiency in long-text training and inference [2][3][4]. Group 1: Model Release and Features - DeepSeek Company launched the experimental version DeepSeek-V3.2-Exp, which introduces a sparse attention mechanism for optimizing long text training and inference [2]. - The new model has a substantial size of 671GB, requiring approximately 8-10 hours for download under ideal bandwidth conditions [3]. Group 2: Collaboration and Industry Impact - Cambricon's quick adaptation to DeepSeek-V3.2-Exp indicates prior collaboration and communication between the two companies, reflecting a trend of low-profile yet effective partnerships in the tech industry [3]. - The collaboration between leading companies in the AI model and chip sectors is expected to significantly reduce training and inference costs for users, facilitating the emergence of AI applications [4].

刚刚，DeepSeek开源V3.2-Exp，公开新稀疏注意力机制DSA

机器之心· 2025-09-29 10:29

Core Viewpoint - DeepSeek has released the experimental version DeepSeek-V3.2-Exp, which introduces a new sparse attention mechanism aimed at optimizing training and inference efficiency in long-context scenarios [3][5][10]. Summary by Sections Model Release - DeepSeek-V3.2-Exp has been open-sourced with a parameter count of 685 billion [3]. - The release includes a paper detailing the new sparse attention mechanism [5]. Sparse Attention Mechanism - The DeepSeek Sparse Attention (DSA) is the only architectural improvement in version 3.2, focusing on enhancing computational efficiency when processing extended text sequences [5][6][10]. - DSA achieves fine-grained sparse attention while maintaining nearly the same output quality as its predecessor, DeepSeek-V3.1-Terminus [9]. Performance Comparison - A comparison of benchmark results between DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp shows that the new version performs comparably across various tasks [11]. - Specific benchmark results include: - MMLU-Pro: 85.0 (V3.1) vs. 85.0 (V3.2) - AIME 2025: 88.4 (V3.1) vs. 89.3 (V3.2) - Codeforces: 2046 (V3.1) vs. 2121 (V3.2) [11]. Future Developments - The upcoming release of Z.ai's GLM-4.6 model is noted, with GLM-4.5 being the previous flagship model [12].

稀疏注意力机制

Transformer架构

Artificial Intelligence

DeepSeek-V3.2-Exp

DeepSeek Sparse Attention (DSA)

DeepSeek-V3.1-Terminus

稀疏注意力机制

Transformer架构

Artificial Intelligence

DeepSeek-V3.2-Exp

DeepSeek Sparse Attention (DSA)

DeepSeek-V3.1-Terminus

SALMONN 系列音视频理解大模型霸榜回归！推理增强、高帧率、无文本泄漏全线突破

机器之心· 2025-09-29 08:28

Core Insights - The SALMONN family has expanded significantly with the introduction of new models, including video-SALMONN 2/2+, video-SALMONN-o1, and F-16, solidifying its leadership in open-source audio-visual understanding models [1][6][36] - The video-SALMONN 2+ model focuses on high-quality, complete video descriptions, achieving state-of-the-art results in caption integrity and accuracy [4][6] - The F-16 model is designed for high frame rate video understanding, addressing the limitations of existing models that operate at low frame rates [25][31] Model Performance - The video-SALMONN 2+ model outperforms competitors like GPT-4o and Google Gemini 1.5 Pro in various audio-visual understanding benchmarks, demonstrating superior performance in tasks such as Video-MME and WorldSense [6][7] - The model's ability to generate high-quality descriptions enhances its performance in question-answering tasks, indicating a robust understanding of audio-visual content [6][9] - The introduction of the AVUT benchmark aims to create a fair evaluation standard for audio-visual understanding, addressing the issue of text shortcuts in existing benchmarks [32][35] Technical Innovations - The process DPO (pDPO) training method enhances the model's ability to perform step-level optimization in audio-visual contexts, improving its self-checking capabilities [24] - The F-16 model employs multi-frame joint alignment compression to maintain semantic integrity while reducing computational costs, achieving significant advancements in high frame rate video tasks [25][29] - The video-SALMONN-o1 model introduces reasoning enhancement, allowing for evidence-based multi-step reasoning in audio-visual scenarios, which is a significant advancement over existing systems [21][22] Future Directions - The SALMONN series is expected to continue evolving, with ongoing iterations aimed at improving model capabilities and establishing a comprehensive ecosystem for audio-visual understanding [36][38]

Artificial Intelligence

Artificial Intelligence

SALMONN系列音视频理解大模型

腾讯混元3D-Omni：3D版ControlNet突破多模态控制，实现高精度3D资产生成

机器之心· 2025-09-29 06:55

Core Viewpoint - The article discusses the launch of Hunyuan 3D-Omni by Tencent, a unified multimodal controllable 3D generation framework that addresses the limitations of existing methods reliant on image inputs, enhancing the precision and versatility of 3D asset creation in various industries [2][5][31]. Background and Challenges - The increasing scale of 3D data has led to the rise of generative models based on native 3D representations like point clouds and voxels, with Hunyuan3D 2.1 utilizing a combination of 3D Variational Autoencoders (VAE) and Latent Diffusion Models (LDM) for efficient 3D model generation [5]. - Existing methods face challenges such as geometric inaccuracies due to single-view image inputs, difficulties in fine control over object proportions and details, and limitations in adapting to multimodal inputs [6][7]. Core Innovations of Hunyuan3D Omni - Hunyuan 3D-Omni introduces two key innovations: a lightweight unified control encoder for handling multiple control conditions and a progressive difficulty-aware training strategy to enhance robustness in multimodal integration [9][10]. - The framework supports up to four types of control signals, significantly improving the controllability and quality of generated results [9]. Key Implementation Methods - The system utilizes various control signals: 1. Skeleton for character motion control 2. Bounding Box for adjusting object proportions 3. Point Cloud for providing geometric structure prior 4. Voxel for sparse geometric hints [11][14]. Experimental Results - The model demonstrates high-quality generation of character geometries aligned with target poses when using skeleton control, showcasing its ability to maintain geometric details across various input styles [18][19]. - Bounding box control effectively adjusts object proportions, enabling intelligent geometric reconstruction, as evidenced by successful generation of complex structures [23][25]. - Point cloud inputs significantly mitigate geometric ambiguities inherent in single-view images, ensuring accurate alignment with real-world structures [25][27]. - Voxel conditions enhance the model's ability to reconstruct detailed geometric features, improving overall generation quality [27][28]. Conclusion - Hunyuan 3D-Omni represents a lightweight, multimodal, and controllable 3D generation framework that integrates various geometric and control signals without compromising the foundational model capabilities, paving the way for future advancements in multimodal 3D generation [31].

首个零样本跨本体泛化开源具身模型：智源RoboBrain-X0 技术细节全解析

机器之心· 2025-09-29 06:55

| 机器之心发布 | | --- | 机器之心编辑部为具身智能行业提供了一个可复用、可扩展的通用基座，同时开源训练数据集。今天，北京智源人工智能研究院（BAAI）正式开源 RoboBrain-X0，一个能够在零样本泛化、轻量微调条件下，驱动多种不同真实机器人完成复杂任务的具身智能基座大模型。其核心突破在于：用统一的动作空间与分层任务拆解，实现了「一个基座模型，N种身体」，为通用具身智能提供一条切实可行的路径。 RoboBrain-X0 源自 RoboBrain 的多模态基座能力，在 RoboBrain 2.0 数据基础上，进一步融合了真实机器人动作数据。通过统一建模视觉、语言与动作，它实现了跨本体的泛化与适配，具备从感知到执行的一体化能力。据智源团队公开的评测，RoboBrain-X0 在多个主流机器人本体上的真机实验显示：这些结果意味着，RoboBrain-X0 不仅是理论上的「通用基座」，而且已在工程实践中迈出了从单点突破到规模化落地的关键一步。作为新一代跨本体基座大模型，RoboBrain-X0 突破对单一机器人体系的依赖，实现异构本体统一建模，并具备实用级 zero-sho ...

在具身智能的岔路口，这场论坛把数据、模型、Infra聊透了

机器之心· 2025-09-29 02:52

Core Viewpoint - The field of embodied intelligence is experiencing unprecedented attention, yet key issues remain unresolved, including data scarcity and differing technical approaches [1][2][3] Group 1: Data and Technical Approaches - The industry is divided into two factions: the "real machine" faction, which relies on real-world data collection, and the "synthetic" faction, which believes in the feasibility of synthetic data for model training [5][12] - Galaxy General, representing the synthetic faction, argues that achieving generalization in embodied intelligence models requires trillions of data points, which is unsustainable through real-world data alone [8][9] - The "real machine" faction challenges the notion that real-world data is prohibitively expensive, suggesting that with sufficient investment, data collection can be scaled effectively [12][14] Group 2: Model Architecture - Discussions around the architecture of embodied intelligence models highlight a divide between end-to-end and layered approaches, with some experts advocating for a unified model while others support a hierarchical structure [15][19] - The layered architecture is seen as more aligned with biological evolution, while the end-to-end approach is criticized for potential error amplification [19][20] - The debate extends to the relevance of VLA (Vision-Language Alignment) versus world models, with some experts arguing that VLA is currently more promising due to its data efficiency [21][22] Group 3: Industry Trends and Infrastructure - The scaling law in embodied intelligence is beginning to emerge, indicating that expanding model and data scales could be effective [24] - The industry is witnessing an acceleration in the deployment of embodied intelligence technologies, with various companies sharing their experiences in human-robot interaction and industrial applications [24][29] - Cloud service providers, particularly Alibaba Cloud, are emphasized as crucial players in supporting the infrastructure needs of embodied intelligence companies, especially as they transition to mass production [29][31] Group 4: Alibaba Cloud's Role - Alibaba Cloud has been preparing for the exponential growth in data and computational needs associated with embodied intelligence, having developed capabilities to handle large-scale data processing and model training [33][35] - The company offers a comprehensive suite of cloud-based solutions to support both real and synthetic data production, enhancing efficiency and reducing costs [35][36] - Alibaba Cloud's unique position as a model provider and its engineering capabilities are seen as significant advantages in the rapidly evolving embodied intelligence landscape [37][41]

千寻智能高阳团队最新成果：纯视觉VLA方案从有限数据中学到强大的空间泛化能力

机器之心· 2025-09-29 02:52

设想一下刚学开车的情况：在训练场上，我们可能会反复练习特定动作：到了某个位置就踩刹车，拐到某个点就打方向盘。久而久之，这些动作会形成 "条件记忆"，一旦环境发生变化，就容易手忙脚乱。最近，千寻智能的研究人员注意到，基于模仿学习的视觉运动策略中也存在类似现象，并在论文《Do You Need Proprioceptive States in Visuomotor Policies?》中对此进行了深入探讨。论文链接：https://arxiv.org/abs/2509.18644 项目主页：https://statefreepolicy.github.io 文中研究人员提出了一种名为 State-free Policy 的策略，与 State-based Policy 相比，即便在训练数据中桌面高度、机器人位置和目标物体等都被严格固定的情况下，机器人仍能展现出强大的空间泛化能力。例如：在夹笔任务中，获得桌面高度的泛化能力（标准桌高为 80 cm）：在叠衣服任务中，即使机械臂位置大幅偏离标准位置，机器人仍然能出色完成任务：在全身机器人从冰箱拿饮料的过程中，即使冰箱位置发生移动，机器人也能够适应：事实上 ...

大神爆肝一个月，复刻DeepMind世界模型，300万参数就能玩实时交互像素游戏

机器之心· 2025-09-28 10:29

Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].

世界模型

Artificial Intelligence

Genie 3

TinyWorlds

世界模型

Artificial Intelligence

Genie 3

TinyWorlds

Previous Next