Workflow
机器之心
icon
Search documents
LLM工业级自进化:北邮与腾讯AI Lab提出MoE-CL架构,解决大模型持续学习核心痛点
机器之心· 2025-09-30 00:27
在工业级大语言模型(LLM)应用中,动态适配任务与保留既有能力的 "自进化" 需求日益迫切。真实场景中,不同领域语言模式差异显著,LLM 需在学习新场景 合规规则的同时,不丢失旧场景的判断能力。这正是大模型自进化核心诉求,即 "自主优化跨任务知识整合,适应动态环境而无需大量外部干预"。 为解决此问题,北邮百家 AI 团队与腾讯 AI Lab 团队提出参数高效的对抗性混合专家架构 MoE-CL,专门用于 LLM 的自进化持续指令微调。其核心设计在于 "解 耦 LoRA 专家" 与 "GAN 对抗降噪" 的结合:为每个任务配置专属 LoRA 专家以保留任务特定知识,避免参数更新相互干扰;同时设置共享 LoRA 专家,通过生成 对抗网络(GAN)中的任务感知鉴别器抑制无关噪声,确保跨任务知识高效且精准传递,最终实现 "知识保留" 与 "跨任务泛化" 的平衡,这也是 LLM 自进化的核 心逻辑。 从实验效果来看,MoE-CL 的自进化能力已在实际场景与基准测试中得到验证。在腾讯真实业务场景 A/B 测试中,它将人工介入成本降低 15.3%;在公开 MTL5 跨域基准与工业级 Tencent3 基准测试中,其平均准确率 ...
Claude Sonnet 4.5来了!能连续编程30多小时、1.1万行代码
机器之心· 2025-09-30 00:27
Core Insights - The article discusses the recent advancements in AI models, particularly the release of Claude Sonnet 4.5 by Anthropic, which is positioned as a leading model in various benchmarks and applications [1][4][5]. Model Performance - Claude Sonnet 4.5 achieved significant performance improvements in various benchmarks, including: - 77.2% in Agentic coding [2] - 82.0% in SWE-bench Verified [2] - 61.4% in OSWorld for computer use, up from 42.2% in the previous version [11] - The model shows enhanced capabilities in reasoning and mathematics, with a perfect score of 100% in high school math competitions [12][13]. Developer Tools and Features - Anthropic introduced the Claude Agent SDK, allowing developers to create their own intelligent agents [4][35]. - New features include checkpoint functionality for saving progress, a revamped terminal interface, and native VS Code extensions [8][4]. Safety and Alignment - Claude Sonnet 4.5 is noted for being the most aligned model to human values, with improvements in reducing undesirable behaviors such as flattery and deception [27][5]. - The model is released under AI safety level 3 (ASL-3), incorporating classifiers to detect potentially dangerous inputs and outputs [32]. User Experience and Applications - Early user experiences indicate that Claude Sonnet 4.5 performs exceptionally well in specialized fields such as finance, law, and STEM [13][21]. - The "Imagine with Claude" feature allows real-time software generation without pre-defined functions, showcasing the model's adaptability [36][38].
强强联手!深度求索、寒武纪同步发布DeepSeek-V3.2模型架构和基于vLLM的模型适配源代码
机器之心· 2025-09-29 11:05
Core Viewpoint - The release of DeepSeek-V3.2 by DeepSeek Company and its adaptation by Cambricon signifies a strong collaboration among leading tech firms in China's AI industry, aiming to enhance efficiency in long-text training and inference [2][3][4]. Group 1: Model Release and Features - DeepSeek Company launched the experimental version DeepSeek-V3.2-Exp, which introduces a sparse attention mechanism for optimizing long text training and inference [2]. - The new model has a substantial size of 671GB, requiring approximately 8-10 hours for download under ideal bandwidth conditions [3]. Group 2: Collaboration and Industry Impact - Cambricon's quick adaptation to DeepSeek-V3.2-Exp indicates prior collaboration and communication between the two companies, reflecting a trend of low-profile yet effective partnerships in the tech industry [3]. - The collaboration between leading companies in the AI model and chip sectors is expected to significantly reduce training and inference costs for users, facilitating the emergence of AI applications [4].
刚刚,DeepSeek开源V3.2-Exp,公开新稀疏注意力机制DSA
机器之心· 2025-09-29 10:29
Core Viewpoint - DeepSeek has released the experimental version DeepSeek-V3.2-Exp, which introduces a new sparse attention mechanism aimed at optimizing training and inference efficiency in long-context scenarios [3][5][10]. Summary by Sections Model Release - DeepSeek-V3.2-Exp has been open-sourced with a parameter count of 685 billion [3]. - The release includes a paper detailing the new sparse attention mechanism [5]. Sparse Attention Mechanism - The DeepSeek Sparse Attention (DSA) is the only architectural improvement in version 3.2, focusing on enhancing computational efficiency when processing extended text sequences [5][6][10]. - DSA achieves fine-grained sparse attention while maintaining nearly the same output quality as its predecessor, DeepSeek-V3.1-Terminus [9]. Performance Comparison - A comparison of benchmark results between DeepSeek-V3.1-Terminus and DeepSeek-V3.2-Exp shows that the new version performs comparably across various tasks [11]. - Specific benchmark results include: - MMLU-Pro: 85.0 (V3.1) vs. 85.0 (V3.2) - AIME 2025: 88.4 (V3.1) vs. 89.3 (V3.2) - Codeforces: 2046 (V3.1) vs. 2121 (V3.2) [11]. Future Developments - The upcoming release of Z.ai's GLM-4.6 model is noted, with GLM-4.5 being the previous flagship model [12].
SALMONN 系列音视频理解大模型霸榜回归!推理增强、高帧率、无文本泄漏全线突破
机器之心· 2025-09-29 08:28
Core Insights - The SALMONN family has expanded significantly with the introduction of new models, including video-SALMONN 2/2+, video-SALMONN-o1, and F-16, solidifying its leadership in open-source audio-visual understanding models [1][6][36] - The video-SALMONN 2+ model focuses on high-quality, complete video descriptions, achieving state-of-the-art results in caption integrity and accuracy [4][6] - The F-16 model is designed for high frame rate video understanding, addressing the limitations of existing models that operate at low frame rates [25][31] Model Performance - The video-SALMONN 2+ model outperforms competitors like GPT-4o and Google Gemini 1.5 Pro in various audio-visual understanding benchmarks, demonstrating superior performance in tasks such as Video-MME and WorldSense [6][7] - The model's ability to generate high-quality descriptions enhances its performance in question-answering tasks, indicating a robust understanding of audio-visual content [6][9] - The introduction of the AVUT benchmark aims to create a fair evaluation standard for audio-visual understanding, addressing the issue of text shortcuts in existing benchmarks [32][35] Technical Innovations - The process DPO (pDPO) training method enhances the model's ability to perform step-level optimization in audio-visual contexts, improving its self-checking capabilities [24] - The F-16 model employs multi-frame joint alignment compression to maintain semantic integrity while reducing computational costs, achieving significant advancements in high frame rate video tasks [25][29] - The video-SALMONN-o1 model introduces reasoning enhancement, allowing for evidence-based multi-step reasoning in audio-visual scenarios, which is a significant advancement over existing systems [21][22] Future Directions - The SALMONN series is expected to continue evolving, with ongoing iterations aimed at improving model capabilities and establishing a comprehensive ecosystem for audio-visual understanding [36][38]
腾讯混元3D-Omni:3D版ControlNet突破多模态控制,实现高精度3D资产生成
机器之心· 2025-09-29 06:55
Core Viewpoint - The article discusses the launch of Hunyuan 3D-Omni by Tencent, a unified multimodal controllable 3D generation framework that addresses the limitations of existing methods reliant on image inputs, enhancing the precision and versatility of 3D asset creation in various industries [2][5][31]. Background and Challenges - The increasing scale of 3D data has led to the rise of generative models based on native 3D representations like point clouds and voxels, with Hunyuan3D 2.1 utilizing a combination of 3D Variational Autoencoders (VAE) and Latent Diffusion Models (LDM) for efficient 3D model generation [5]. - Existing methods face challenges such as geometric inaccuracies due to single-view image inputs, difficulties in fine control over object proportions and details, and limitations in adapting to multimodal inputs [6][7]. Core Innovations of Hunyuan3D Omni - Hunyuan 3D-Omni introduces two key innovations: a lightweight unified control encoder for handling multiple control conditions and a progressive difficulty-aware training strategy to enhance robustness in multimodal integration [9][10]. - The framework supports up to four types of control signals, significantly improving the controllability and quality of generated results [9]. Key Implementation Methods - The system utilizes various control signals: 1. Skeleton for character motion control 2. Bounding Box for adjusting object proportions 3. Point Cloud for providing geometric structure prior 4. Voxel for sparse geometric hints [11][14]. Experimental Results - The model demonstrates high-quality generation of character geometries aligned with target poses when using skeleton control, showcasing its ability to maintain geometric details across various input styles [18][19]. - Bounding box control effectively adjusts object proportions, enabling intelligent geometric reconstruction, as evidenced by successful generation of complex structures [23][25]. - Point cloud inputs significantly mitigate geometric ambiguities inherent in single-view images, ensuring accurate alignment with real-world structures [25][27]. - Voxel conditions enhance the model's ability to reconstruct detailed geometric features, improving overall generation quality [27][28]. Conclusion - Hunyuan 3D-Omni represents a lightweight, multimodal, and controllable 3D generation framework that integrates various geometric and control signals without compromising the foundational model capabilities, paving the way for future advancements in multimodal 3D generation [31].
首个零样本跨本体泛化开源具身模型:智源RoboBrain-X0 技术细节全解析
机器之心· 2025-09-29 06:55
| 机器之心发布 | | --- | 机器之心编辑部 为具身智能行业提供了一个可复用、可扩展的通用基座,同时开源训练数据集。 今天,北京智源人工智能研究院(BAAI)正式开源 RoboBrain-X0,一个能够在 零样本 泛化 、 轻量 微调 条件下,驱动多种不同真实机器人完成复杂任 务的具身智能基座大模型。其核心突破在于:用统一的动作空间与分层任务拆解,实现了「一个基座模型,N种身体」,为通用具身智能提供一条切实可行 的路径。 RoboBrain-X0 源自 RoboBrain 的多模态基座能力,在 RoboBrain 2.0 数据基础上,进一步融合了真实机器人动作数据。通过统一建模视觉、语言与动 作,它实现了跨本体的泛化与适配,具备从感知到执行的一体化能力。 据智源团队公开的评测,RoboBrain-X0 在多个主流机器人本体上的真机实验显示: 这些结果意味着,RoboBrain-X0 不仅是理论上的「通用基座」,而且已在工程实践中迈出了从单点突破到规模化落地的关键一步。 作为新一代跨本体基座大模型,RoboBrain-X0 突破对单一机器人体系的依赖,实现异构本体统一建模,并具备实用级 zero-sho ...
在具身智能的岔路口,这场论坛把数据、模型、Infra聊透了
机器之心· 2025-09-29 02:52
Core Viewpoint - The field of embodied intelligence is experiencing unprecedented attention, yet key issues remain unresolved, including data scarcity and differing technical approaches [1][2][3] Group 1: Data and Technical Approaches - The industry is divided into two factions: the "real machine" faction, which relies on real-world data collection, and the "synthetic" faction, which believes in the feasibility of synthetic data for model training [5][12] - Galaxy General, representing the synthetic faction, argues that achieving generalization in embodied intelligence models requires trillions of data points, which is unsustainable through real-world data alone [8][9] - The "real machine" faction challenges the notion that real-world data is prohibitively expensive, suggesting that with sufficient investment, data collection can be scaled effectively [12][14] Group 2: Model Architecture - Discussions around the architecture of embodied intelligence models highlight a divide between end-to-end and layered approaches, with some experts advocating for a unified model while others support a hierarchical structure [15][19] - The layered architecture is seen as more aligned with biological evolution, while the end-to-end approach is criticized for potential error amplification [19][20] - The debate extends to the relevance of VLA (Vision-Language Alignment) versus world models, with some experts arguing that VLA is currently more promising due to its data efficiency [21][22] Group 3: Industry Trends and Infrastructure - The scaling law in embodied intelligence is beginning to emerge, indicating that expanding model and data scales could be effective [24] - The industry is witnessing an acceleration in the deployment of embodied intelligence technologies, with various companies sharing their experiences in human-robot interaction and industrial applications [24][29] - Cloud service providers, particularly Alibaba Cloud, are emphasized as crucial players in supporting the infrastructure needs of embodied intelligence companies, especially as they transition to mass production [29][31] Group 4: Alibaba Cloud's Role - Alibaba Cloud has been preparing for the exponential growth in data and computational needs associated with embodied intelligence, having developed capabilities to handle large-scale data processing and model training [33][35] - The company offers a comprehensive suite of cloud-based solutions to support both real and synthetic data production, enhancing efficiency and reducing costs [35][36] - Alibaba Cloud's unique position as a model provider and its engineering capabilities are seen as significant advantages in the rapidly evolving embodied intelligence landscape [37][41]
千寻智能高阳团队最新成果:纯视觉VLA方案从有限数据中学到强大的空间泛化能力
机器之心· 2025-09-29 02:52
设想一下刚学开车的情况:在训练场上,我们可能会反复练习特定动作:到了某个位置就踩刹车,拐到某个点就打方向盘。久而久之,这些动作会形成 "条件记 忆",一旦环境发生变化,就容易手忙脚乱。最近,千寻智能的研究人员注意到,基于模仿学习的视觉运动策略中也存在类似现象,并在论文《Do You Need Proprioceptive States in Visuomotor Policies?》中对此进行了深入探讨。 论文链接:https://arxiv.org/abs/2509.18644 项目主页:https://statefreepolicy.github.io 文中研究人员提出了一种名为 State-free Policy 的策略,与 State-based Policy 相比,即便在训练数据中桌面高度、机器人位置和目标物体等都被严格固定的情况 下,机器人仍能展现出强大的空间泛化能力。例如: 在夹笔任务中,获得桌面高度的泛化能力(标准桌高为 80 cm): 在叠衣服任务中,即使机械臂位置大幅偏离标准位置,机器人仍然能出色完成任务: 在全身机器人从冰箱拿饮料的过程中,即使冰箱位置发生移动,机器人也能够适应: 事实上 ...
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].