机器之心
Search documents
具身VLA后训练:TeleAI提出潜空间引导的VLA跨本体泛化方法
机器之心· 2025-09-08 06:22
在多模态大模型的基座上, 视觉 - 语言 - 动作(Visual-Language-Action, VLA) 模型使用大量机器人操作数据进行预训练,有望实现通用的具身操作能力。然 而, 现有 VLA 基座模型的能力仍存在很大不足,在进行目标场景应用时需要采集数十乃至数百小时目标本体数据完成后训练 (Post-Training),特别是当目标场 景本体和预训练本体存在差异时,预训练和后训练阶段的动作分布出现严重失配,从而引发了 VLA 模型跨本体适配(Cross-Embodiment Adaption)挑战。在后训 练阶段通过堆叠目标本体数据对抗这种失配的边际收益迅速递减,也难以有效拟合目标场景动作分布。 为了解决该问题, 中国电信人工智能研究院( TeleAl )具身智能团队 提出了 一种 " 对齐 - 引导 - 泛化"(Align then Steer, ATE)的 VLA 跨本体泛化框架,破解了 VLA 后训练难题 。其核心思想是在潜空间中对齐跨本体动作分布,从而在后训练利用统一潜空间梯度引导 VLA 策略的更新方向。无需改动现有 VLA 主干架构, 实现了 VLA 模型后训练从调架构向调分布的范式转移, ...
空间智能新高度:港科大谭平团队SAIL-Recon突破万帧级图像大规模3D场景重建Transformer
机器之心· 2025-09-08 01:21
机器之心报道 机器之心编辑部 香港科技大学 谭平 教 授团队 与地平线(Horizon Robotics)团队 最新发布了一项 3D 场景表征与 大规模重 建新方法 SAIL-Recon ,通过锚点图建立构建场景全局 隐式表征,突破现有 VGGT 基础模型对于大规模视觉定位与 3D 重建的处理能力瓶颈,实现万帧级的场景表征抽取与定位重建, 将空间智能「3D 表征与建模」前 沿推向一个新的高度 。该技术作为 3D 场景表征与重建的一个 基础模型 ,不仅可以用于任意场景中的大规模 3D 重建和空间漫游,也可以为机器人的 3D 空间感 知、自主空间定位与导航提供基础技术支撑。 谭平教授目前为香港科技大学电子与计算机工程系正教授,冯诺依曼人工智能研究院副院长,也是「香港科技大学–比亚迪具身智能联合实验室」主任,长期致力 于 3D 空间智能与具身智能相关的技术前沿研究。 谭平教授创立的 人工智能初创公司「光影焕像」 致力于 3D 和空间智能的核心技术和产品研发,打造 3D 空间智能大脑,推进相关技术在游戏、影视和具身智能 等行业场景的商业化应用。 作者简介: 邓俊源 分别于 2021 年和 2024 年获上海交通大学学 ...
SceneSplat: 基于3DGS的场景理解和视觉语言预训练,让3D高斯「听懂人话」的一跃
机器之心· 2025-09-07 08:21
Core Insights - The article introduces SceneSplat, the first end-to-end large-scale 3D indoor scene understanding method that operates natively on 3D Gaussian Scenes (3DGS) [2][6] - A self-supervised learning scheme is proposed to unlock rich 3D feature learning from unlabelled scenes, addressing the lack of models that can independently handle 3D data for semantic learning [2][6] - The SceneSplat-7K dataset is created, consisting of 7,916 scenes sourced from seven existing datasets, enabling effective training and testing of the SceneSplat model [2][6] Dataset Construction - SceneSplat-7K includes 7,916 processed 3DGS scenes and a total of 11.27 billion Gaussian points, with an average of approximately 1.42 million points per scene [6][7] - The dataset's construction required computational resources equivalent to 150 days of running on L4 GPUs, ensuring high reconstruction quality with a PSNR of 29.64 dB and average Depth-L1 of 0.035 m [6][7] Semantic Annotation - A stable and fast system is utilized for annotating semantic information in 3DGS, employing SAMv2 for object-level segmentation and SigLIP2 for extracting visual-language features [8][10] - The pre-trained encoder learns rich semantic representations solely based on 3DGS parameters and neighborhood information, eliminating the need for 2D fusion during inference [8][10] Training Methodology - Two training routes are provided: visual-language pre-training for labelled data and self-supervised training for unlabelled data, maximizing the learning potential of unlabelled scenes [12][14] - The model employs a hierarchical Transformer architecture, using Gaussian tokens and neighborhood attention to achieve effective semantic vector regression [15] Experimental Results - The SceneSplat method achieves state-of-the-art (SOTA) results in zero-shot semantic segmentation on datasets like ScanNet200, ScanNet++, and Matterport3D [21][22] - Quantitative experiments demonstrate significant improvements in mean Intersection over Union (mIoU) and mean Accuracy (mAcc) across various datasets, showcasing the model's robustness [22][23] Future Work - The SceneSplat-7K dataset is being expanded to SceneSplat-49K, with ongoing benchmarking of 3DGS and semantic integration across multiple datasets [31]
Claude不让我们用!国产平替能顶上吗?
机器之心· 2025-09-07 08:21
Core Viewpoint - The global AI code generation competition is experiencing a significant shift, with OpenAI's GPT-5 series models gaining strength while Anthropic's position is weakening due to internal issues and external competition [1][4]. Group 1: Competitive Landscape - Anthropic's models, including Claude Opus 4.1 and Opus 4, have been acknowledged to have reduced capabilities, leading to a decline in their competitive edge [1]. - OpenAI's GPT-5 Pro is being promoted for its superior coding capabilities, indicating a strong market presence [1]. - Domestic AI model manufacturers are launching new models targeting code generation, such as Kimi-K2-0905 and Qwen3-Max-Preview, which emphasize performance improvements in programming tasks [2][6]. Group 2: Technical Advancements - Kimi-K2-0905 features a context length of 256k and has improved correctness, stability, and logical consistency in long code generation tasks [2][6]. - The model utilizes a Mixture-of-Experts (MoE) architecture with a total of 1 trillion parameters, activating 32 billion during inference, showcasing significant technical capabilities [7][6]. - Kimi-K2-0905 has achieved over 390,000 downloads on Hugging Face in the past 30 days, indicating strong user interest and adoption [3]. Group 3: Pricing Strategy - Kimi-K2-0905 offers competitive pricing for its API, with costs set at ¥1.00 per million tokens for cache hits and ¥4.00 for cache misses, making it an attractive alternative to Anthropic's pricing [17][18]. - The pricing strategy positions Kimi-K2-0905 as a "Chinese alternative" to Claude, maintaining compatibility with Anthropic's API [18][19]. Group 4: Market Integration - Domestic AI manufacturers are increasingly integrating their models into mainstream development tools and applications, enhancing their presence in the market [23]. - The ongoing improvements in performance and user experience are expected to create a positive feedback loop, fostering a more robust application ecosystem and expanding market opportunities [23].
字节跳动Seed推出「机器人大脑」Robix:让机器人学会思考、规划与灵活互动
机器之心· 2025-09-07 05:12
近日,字节跳动 Seed 团队发布了最新的机器人研究成果—— Robix ,一个旨在提升机器人思考、规划与灵活交互能力的「机器人大脑」。 根据报告与演示视频,搭载 Robix 的机器人已展现出一系列过去难以实现的复杂交互能力: …… 标题:Robix: A Unified Model for Robot Interaction, Reasoning and Planning ArXiv: https://arxiv.org/abs/2509.01106 项目主页:https://robix-seed.github.io/robix/ 在做饭时,它不仅能根据菜名(如「鱼香肉丝」)准备食材,还能主动发现缺少配料并询问是否需要补齐; 在用户中途改变主意时,它可立即停止当前操作并灵活执行新指令; 在你随手涂鸦时,它能识别出画中的物体,并自然地给予回应与赞赏; 以下演示视频将直观展示 Robix 在真实互动场景中的工作方式。 核心思想: 长期以来,通用机器人在处理复杂、长程任务时,往往因依赖 "模块化" 拼接的设计而显得僵化。Robix 的核心亮点在于其 一体化架构 :将推理、任务规划与人机 交互无缝整合到单个端到端多 ...
国内外AI大厂重押,初创梭哈,谁能凭「记忆」成为下一个「DeepSeek」?
机器之心· 2025-09-07 05:12
Core Viewpoint - The article discusses the emerging importance of "memory" in AI models, suggesting that the ability to possess human-like memory will be a key factor in the next wave of AI advancements [2][6][35]. Group 1: Importance of Memory in AI - The concept of "memory" is evolving from short-term to long-term or lifelong memory, allowing AI to learn continuously and adapt to new tasks without forgetting previous knowledge [3][7]. - Recent developments in AI memory capabilities have been highlighted by major players like Anthropic, Google, ByteDance, and OpenAI, all of which have introduced memory features in their AI systems [4][6][35]. - The demand for memory capabilities is driven by both technical and application needs, as AI models are increasingly expected to function as long-term partners rather than just tools [20][21][23]. Group 2: Current Trends and Developments - Various AI companies are exploring different approaches to implement memory, including parameterized memory, context memory, and external databases [26][28][30]. - The industry is witnessing a surge in interest and investment in memory-related research, with many companies racing to develop and integrate these capabilities into their products [6][35]. - The competition among AI firms is intensifying, with the potential for breakthroughs in memory capabilities to redefine the market landscape, similar to past pivotal moments in AI development [35][36]. Group 3: Future Outlook - The timeline for achieving widespread and effective memory capabilities in AI is estimated to be one to two years for basic functionalities, while addressing governance and privacy issues may take three to five years [36][37]. - The future of AI memory capabilities remains uncertain, with various players in the industry vying for dominance, indicating that any company could emerge as a leader in this space [38].
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
机器之心· 2025-09-07 05:12
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]
从 SEAL 自适应学习到 DFT 奖励矫正,LLM 泛化能力的实质提升又有多少?
机器之心· 2025-09-07 01:30
Core Insights - The article discusses the challenges and advancements in the generalization capabilities of large language models (LLMs), highlighting various strategies to improve these capabilities, such as adaptive fine-tuning and dynamic gradient adjustment [7][11]. Group 1: Generalization in LLMs - Generalization in AI refers to a model's ability to apply learned knowledge to new, unseen scenarios, distinguishing it from mere memorization of training data [8]. - Recent studies indicate that as the complexity and scale of models increase, the understanding of "generalization" is being questioned, with some suggesting it may be a form of data memorization rather than true abstraction [9][10]. - Research shows that while increasing model size can enhance performance on reasoning tasks, it may also lead to stronger memorization of factual knowledge, raising concerns about the true nature of generalization [9][10]. Group 2: CoT Reasoning and Its Limitations - Chain-of-Thought (CoT) reasoning has been criticized for its fragility, as performance drops significantly when tested outside the training distribution, suggesting reliance on memory rather than genuine logical reasoning [10]. - Some experts argue that what is perceived as generalization may simply be the result of training data sufficiently covering the test scenarios, challenging the notion of true generalization [10]. Group 3: Research Trends and Focus Areas - The volume of research related to LLMs has surged, with a nearly sixfold increase in relevant studies from 2022 to 2025, particularly focusing on reasoning, generalization, and model safety [11]. - Recent research has shifted from merely examining data distribution and model size to exploring training strategies, model update mechanisms, and data design to enhance generalization capabilities [11].
想要「版本」超车,Agent 需要怎样的「Environment」?
机器之心· 2025-09-06 07:00
Core Viewpoint - The article discusses the recent transformation of AI startup you.com from a search engine to an AI infrastructure company following a $100 million Series C funding round. This shift aligns with the "product-driven infrastructure" strategy and reflects a broader trend of commercializing Agentic AI from laboratory settings [1]. Group 1: Agent Environment and Its Evolution - The focus of artificial intelligence is shifting from content creation to goal-driven, autonomous Agentic AI, driven by rapid advancements in the field [4]. - AI agents are expected to become the new interface for human-computer interaction, allowing users to issue commands in natural language without needing to write code [5]. - Companies like Cursor, Bolt, and Mercor have achieved significant revenue growth by leveraging unique intelligent agent products [6]. Group 2: Development of Agent Environment - The development of a suitable "Agent Environment" is crucial for modern intelligent applications, balancing the need for freedom in code execution with security and isolation [7]. - Companies like E2B and Modal Labs are providing secure, isolated cloud environments (sandboxes) for running AI-generated code [7]. - The concept of Agent Environment can be traced back to reinforcement learning, where it serves as a simulated space for training agents through trial and error [8]. Group 3: Real-World Application and Safety - As LLM-based agents advance, the requirements for their environments are evolving from training spaces to operational zones, necessitating safe access to real-world tools [9]. - Different types of agents require distinct environments, such as physical environments for robots and digital environments for virtual assistants [10].
英伟达的局:狂撒15亿美元,从Lambda那租到了搭载自家AI芯片的GPU服务器
机器之心· 2025-09-06 06:00
Core Viewpoint - Nvidia has secured a significant partnership with Lambda, a smaller cloud service provider, involving a total deal worth up to $1.5 billion, which includes leasing GPU servers equipped with Nvidia's AI chips [1][3]. Summary by Sections - **Partnership Details**: The partnership consists of two transactions: one worth $1.3 billion for leasing 10,000 GPU servers over four years, and another worth $200 million for leasing 8,000 servers without a specified timeframe [1][3]. - **Lambda's Business Model**: Founded in 2012, Lambda primarily rents out data center space and deploys servers equipped with Nvidia GPUs [2]. - **Impact on Lambda**: Following this deal, Lambda is expected to enhance its revenue, which may increase its chances of going public (IPO) [3]. - **Nvidia's Strategy**: Nvidia's approach involves investing in smaller cloud service providers like Lambda, allowing them to purchase Nvidia's AI chips, and subsequently renting servers from them. This creates a cycle of revenue that benefits both parties and strengthens Nvidia's market position [4][8]. - **Previous Success**: Nvidia has previously executed a similar strategy with CoreWeave, which successfully completed a $1.5 billion IPO in March 2025, marking it as one of the largest venture-backed tech IPOs in recent years [7]. - **Competitive Landscape**: Nvidia's strategy is a response to increasing competition from major AI firms like Microsoft, Google, and Amazon, who are also significant customers of Nvidia. By supporting smaller cloud providers, Nvidia aims to maintain its dominant position in the market [8].