Workflow
机器之心
icon
Search documents
独家解读|2025年AI五大趋势与底层数据革命
机器之心· 2026-01-06 09:38
机器之心发布 2025 年,人工智能的发展重心正在发生一次根本性转移:从追求模型的规模,转向构建其理解与解决复杂现实问题的能力。在这一转型中,高质量数据正成为定 义 AI 能力的新基石。作为人工智能数据服务的前沿探索者,数据堂深度参与并支撑着这场变革的每一个关键环节。本文将深入解读 2025 年 AI 五大技术趋势及其 背后的数据需求变革。 「人情味」与「实时性」革命 趋势解码:追求更细腻的情感与更自然的实时互动 当前,语音合成技术已超越追求「清晰准确」的基础阶段,正同时向两个深度智能化维度演进:一是为合成语音注入情感、个性与文化适配性,让虚拟助手、数 字人、有声内容更具感染力和亲和力;二是从单向反应升级为支持实时打断、重叠对话与上下文连贯的全双工自然交互,这已成为高端智能座舱、实时翻译、拟 真客服等前沿场景的刚需。技术的核心挑战在于,让 AI 不仅能「读」出文字,更能「理解」语境与情绪,并像真人一样实时聆听、思考与回应,实现有情感、有 逻辑的连续对话。 数据需求跃迁:从「清晰样本」到「生动语料」与「交互流」 训练数据的重心正经历双重跃迁。一方面,需构建服务于音色、韵律、情感和风格精细控制的「表现力语料库」, ...
刚刚,智元提出SOP,让VLA模型在真实世界实现可扩展的在线进化
机器之心· 2026-01-06 09:38
对于电子产品,我们已然习惯了「出厂即巅峰」的设定:开箱的那一刻往往就是性能的顶点,随后的每一天都在折旧。 但对于通用机器人来说,这个设定必须被颠覆。 试想,如果一个在实验室里完成训练的 AI 机器人,一进家门面对光线稍暗的房间或堆满杂物的茶几就大脑宕机,那它就永远只能是一个昂贵的实验品。这正是当 前具身智能面临的尴尬真相:我们在互联网知识里训练出了博学的预训练模型,可一旦让它们走进充满未知的物理世界,这些「理论巨人」往往会因为环境变化 而束手无策:「懂」很多道理,却依然干不好家务。 通用机器人的出路,绝不应是被困在出厂设置里的「静态标品」,而应当是能在真实部署中、在每一次失败和纠正中持续变强的生命体。 为了实现这一跨越,智元具身研究中心提出了 SOP(Scalable Online Post-training)框架 。 在过去几年里,基于互联网海量数据预训练的 VLA(视觉 - 语言 - 动作)模型,虽然赋予了机器人一定的通用泛化能力,但始终面临一个难以逾越的鸿沟: 「懂」不代表「能」 。 预训练模型或许「懂」什么是叠衣服,但当它真正面对一件材质松软、光照复杂的真实衣物时,往往会因为 分布偏移 而束手无策。 ...
别被室内基准高分骗了:大模型是在推理空间,还是在「背答案」?
机器之心· 2026-01-06 09:38
Core Insights - The article highlights the emergence of "Spatial Intelligence" as a new frontier in AI, particularly in large models, driven by advancements from scholars like Fei-Fei Li [2] - It raises concerns about the validity of recent performance improvements in models, questioning whether they genuinely understand spatial reasoning or are merely overfitting to similar indoor data distributions [2][16] Group 1: Limitations of Indoor Scene Data - Research in spatial intelligence has predominantly focused on indoor scenes due to a lack of diverse outdoor datasets, which are often based on autonomous driving perspectives, differing fundamentally from first-person pedestrian views [5] - The over-reliance on indoor data leads to high homogeneity between training and testing datasets, making it difficult to fairly assess models' spatial perception and reasoning capabilities [6] Group 2: OSI-Bench Introduction - The OSI-Bench, developed by the University of Chinese Academy of Sciences in collaboration with Microsoft Research Asia and ETH Zurich, aims to provide a more accurate assessment of spatial intelligence by utilizing original video data with precise 3D annotations from open-world environments [2][11] - This benchmark allows for the evaluation of models' true spatial capabilities by decoupling semantic priors from visual spatial intelligence, particularly in complex outdoor settings [9] Group 3: Evaluation Results - Evaluation results from OSI-Bench indicate that current state-of-the-art (SOTA) multimodal large language models generally fail to perform well on spatial reasoning tasks [13] - Despite some models showing significant improvements in indoor benchmarks, such as VSI-Bench, they consistently underperform in OSI-Bench, suggesting overfitting to specific scene distributions rather than genuine spatial intelligence acquisition [16] Group 4: Language Priors and Model Performance - When faced with spatial tasks, models tend to rely on language priors rather than engaging in visual geometric reasoning, leading to minimal performance differences with or without visual input [19][22] - Experiments reveal that models struggle significantly in atypical scenarios where language priors fail, indicating a lack of robust spatial reasoning capabilities [23] Group 5: Future Directions - The article calls for a new paradigm in spatial intelligence that empowers models to perceive and think in spatial contexts, moving beyond mere data-driven distribution fitting [27] - OSI-Bench's benchmark and evaluation code are open-sourced, with plans to continue releasing high-precision 3D information datasets to advance spatial intelligence from indoor to complex open-world scenarios [28]
开源1万小时具身智能数据,这家公司是为了什么?
机器之心· 2026-01-06 09:38
想象一下,你正在训练一个未来的家庭机器人。你希望它能像人一样,轻松地叠好一件衬衫,整理杂乱的桌面,甚至系好一双鞋的鞋带。但最大的瓶颈是什么? 不是算法,不是硬件,而是数据 —— 海量的、来自真实世界的、双手协同的、长程的、多模态的高质量数据。 机器之心发布 因此为了整个具身智能探索加速,开源集合成为了大家的共同选择,从谷歌 Open-X Embodiment、智元 AgiBot Digital World,到智源 RoboCOIN 与它石智航的 World In Your Hands,都在试图构建更庞大、更完善的数据集合,并开源给到全行业。 但在 1 月 6 日,有一家公司将这件事做到新高度,进行了超过 1 万小时、接近百万 clips 的具身数据集合开放,这是行业最大规模、也是泛化程度最高的开源数据 集合,它就是 简智机器人的 "10Kh RealOmni-Open DataSet" 。 ( 下载地址为: https://huggingface.co/datasets/genrobot2025/10Kh-RealOmin-OpenData ,其他数据正在陆续上传。国内也与阿里魔搭、百度百舸合作,方便国内用 ...
黄仁勋CES放出大杀器:下一代Rubin架构推理成本降10倍
机器之心· 2026-01-06 00:31
Core Insights - The article discusses the transformative impact of AI on various industries, highlighting the advancements presented by NVIDIA at CES 2026, particularly focusing on the new Rubin platform and the Alpamayo open-source model for autonomous driving [1][3][5]. Group 1: NVIDIA Rubin Platform - The NVIDIA Rubin platform introduces six new chips aimed at creating a leading AI supercomputer that excels in cost, performance, and security, significantly reducing training time and inference token costs [8][10]. - The platform features innovations such as the latest NVIDIA NVLink interconnect technology, a Transformer engine, and advanced security measures, which collectively enhance AI capabilities and reduce the GPU count needed for training models by four times compared to previous generations [13][17]. - The Rubin platform is designed to meet the increasing demand for AI computing, with a total bandwidth of 260TB/s, surpassing the entire internet's bandwidth, and is expected to be commercially available in the second half of 2026 [19][20]. Group 2: Alpamayo Open-Source Model - The Alpamayo series introduces a visual-language-action (VLA) model that enhances autonomous driving capabilities by enabling vehicles to reason through rare scenarios, thereby improving safety and interpretability [27][28]. - This model is part of a cohesive open ecosystem that includes open-source models, simulation tools, and datasets, allowing developers to build upon it for autonomous driving technology [29][30]. - The Alpamayo model, featuring 10 billion parameters, is designed to generate driving trajectories and reasoning traces from video inputs, providing a foundation for developers to create tailored autonomous driving solutions [30][31]. Group 3: Robotics and Physical AI - NVIDIA has launched new open-source models and frameworks for physical AI, aimed at accelerating the development of versatile robots capable of learning multiple tasks [35][36]. - The company emphasizes the importance of simulation and evaluation frameworks, such as the Isaac Lab-Arena, to streamline the development process and ensure robust performance before deployment [43][45]. - Collaborations with industry leaders in robotics are highlighted, showcasing the integration of NVIDIA's technology into next-generation robots, which are expected to revolutionize various sectors [36][50].
检索做大,生成做轻:CMU团队系统评测RAG的语料与模型权衡
机器之心· 2026-01-06 00:31
Core Insights - The core argument of the research is that expanding the retrieval corpus can significantly enhance Retrieval-Augmented Generation (RAG) performance, often providing benefits that can partially substitute for increasing model parameters, although diminishing returns occur at larger corpus sizes [4][22]. Group 1: Research Findings - The study reveals that the performance of RAG is determined by both the retrieval module, which provides evidence, and the generation model, which interprets the question and integrates evidence to form an answer [7]. - The research indicates that smaller models can achieve performance levels comparable to larger models by increasing the retrieval corpus size, with a consistent pattern observed across multiple datasets [11][12]. - The findings show that the most significant performance gains occur when moving from no retrieval to having retrieval, with diminishing returns as the corpus size increases [13]. Group 2: Experimental Design - The research employed a full factorial design, varying only the corpus size and model size while keeping other variables constant, using a large dataset of approximately 264 million real web documents [9]. - The evaluation covered three open-domain question-answering benchmarks: Natural Questions, TriviaQA, and Web Questions, using common metrics such as F1 and ExactMatch [9]. Group 3: Mechanisms of Improvement - The increase in corpus size enhances the probability of retrieving answer-containing segments, leading to more reliable evidence for the generation model [16]. - The study defines the Gold Answer Coverage Rate, which measures the probability that at least one of the top chunks provided to the generation model contains the correct answer string, showing a monotonic increase with corpus size [16]. Group 4: Practical Implications - The research suggests that when resources are constrained, prioritizing the expansion of the retrieval corpus and improving coverage can allow medium-sized generation models to perform close to larger models [20]. - The study emphasizes the importance of tracking answer coverage and utilization rates as diagnostic metrics to identify whether bottlenecks are in the retrieval or generation components [20].
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
Core Insights - The article discusses the importance of spatial understanding capabilities in multimodal large language models (MLLMs) for their transition into real-world applications as "general intelligent assistants" [2] - It highlights the limitations of existing spatial intelligence evaluation benchmarks, which either rely heavily on template generation or focus on specific spatial tasks, making it difficult to comprehensively assess models' spatial understanding and reasoning abilities in real-world scenarios [2] Group 1: Introduction of MMSI-Video-Bench - The Shanghai Artificial Intelligence Laboratory's InternRobotics team has launched a comprehensive and rigorous spatial intelligence video benchmark called MMSI-Video-Bench, designed to challenge current mainstream multimodal models [2][6] - The benchmark aims to evaluate models' spatial perception, reasoning, and decision-making capabilities in complex and dynamic real-world environments [2][7] Group 2: Benchmark Characteristics - MMSI-Video-Bench features a systematic design of question types that assess models' basic spatial perception abilities based on spatiotemporal information [6] - It includes high-level decision-making evaluations and extends task categories to cover complex real-world scenarios, testing models' cross-video reasoning capabilities, memory update abilities, and multi-view integration [6][8] - The benchmark consists of five major task types and 13 subcategories, ensuring a comprehensive evaluation of spatial intelligence [10] Group 3: Challenge and Performance - The benchmark's questions are designed to be highly challenging, with all models tested, including the best-performing Gemini 3 Pro, achieving only a 38% accuracy rate, indicating a significant performance gap of approximately 60% compared to human levels [10][14] - The evaluation reveals that models struggle with spatial construction, motion understanding, planning, prediction, and cross-video reasoning, highlighting critical bottlenecks in their capabilities [14][15] Group 4: Error Analysis - The research team identified five main types of errors affecting model performance: detailed grounding errors, ID mapping errors, latent logical inference errors, prompt alignment errors, and geometric reasoning errors [17][21] - Geometric reasoning errors were found to be the most prevalent, significantly impacting performance, particularly in spatial construction tasks [19][21] Group 5: Future Directions - The article suggests that introducing 3D spatial cues could assist models in understanding spatial relationships better, indicating a potential direction for future research [22][24] - It emphasizes the need for effective design of spatial cues that models can truly understand and utilize, as current failures are attributed to underlying reasoning capabilities rather than a lack of explicit reasoning steps [27]
Claude Code 一小时「复刻」谷歌一年成果,那一年能读完五年半的博士吗?
机器之心· 2026-01-05 08:54
机器之心编辑部 近日,X 知名博主、Hyperbolic 联创 & CEO Yuchen Jin 发帖称,如果在他读博士的时候就有 Claude Code、Gemini 和 ChatGPT 等各类 AI 工具出现,那么也许只要 一年就能毕业,而不是用了 5.5 年。 而他之所以发出这个感慨,缘由是最近一些硅谷 AI 大厂工程师表示,在用了 AI 工具后,项目完成时长被大幅压缩…… 先是谷歌首席工程师、Gemini API 负责人 Jaana Dogan 在 X 上发文称:「我不是在开玩笑,这也不好笑。从去年开始,我们就在谷歌内部尝试构建分布式 Agent 编 排器。有多种选择,大家并没有完全认同…… 我只是向 Claude Code 描述了问题,它就在一小时内生成了一个东西,而这几乎就是我们去年一年所做的东西。」 随后,她又发文补充,提示内容不算详细,也没有具体细节,只是一段三段式的描述。但由于不能分享任何东西,也不好具体展示出来,总结来说就是在现有一 些想法基础上构建一个玩具版本,用以评估 Claude Code。 随后此推文获得了上百次的浏览,而该网友也发文认真做起了自我介绍,原来 Rohan Anil ...
刚刚,蝉联Future X全球榜首的MiroMind发布全球最强搜索智能体模型
机器之心· 2026-01-05 06:09
Core Viewpoint - MiroMind team has launched its flagship search intelligence model MiroThinker 1.5, emphasizing the concept of "discovery intelligence" as a path to true general artificial intelligence, focusing on external information interaction rather than merely increasing internal parameters [1][10]. Group 1: Model Performance and Comparison - MiroThinker 1.5-30B achieved performance comparable to many 1 trillion parameter models while using only 1/30 of the parameter scale [4]. - In key benchmark tests, MiroThinker 1.5-235B ranked among the top globally, demonstrating its effectiveness despite a smaller parameter size [4]. - MiroThinker 1.5-30B exhibited a significantly lower inference cost of $0.07 per call, which is only 1/20 of the cost of Kimi-K2-Thinking, while also providing faster inference [9]. Group 2: Interactive Scaling and Training Mechanism - MiroMind team has shifted from traditional scaling laws focused on internal parameter expansion to "Interactive Scaling," which emphasizes external information interaction to enhance model performance [10][12]. - The training process encourages models to engage in evidence-seeking behaviors, breaking down key judgments into verifiable sub-hypotheses and actively querying external data [19]. - The model is trained under strict temporal visibility constraints, ensuring it learns to make judgments based only on past information, thus avoiding future leakage [17][20]. Group 3: Unique Training Approaches - MiroThinker 1.5 employs a "scientist mode" rather than a "test-taker mode," focusing on verification and correction rather than memorization [11]. - The model's training paradigm includes a time-sensitive training sandbox, which forces it to operate under real-world conditions of incomplete information and noise [18]. - The training emphasizes iterative verification and self-correction, allowing the model to adjust its hypotheses based on conflicting evidence [19]. Group 4: Market Predictions and Applications - MiroMind has demonstrated its predictive capabilities in stock market scenarios, accurately identifying stocks with high potential for upward movement amidst market noise [22][25][30]. - The model is also applied to predict significant events that may impact major companies, providing insights into potential market reactions and volatility [31].
AAAI 2026 Oral|InfiGUI-G1模型来了,刷新GUI Grounding SOTA
机器之心· 2026-01-05 06:09
随着多模态大语言模型(MLLM)的飞速发展,能够像人类一样通过视觉输入操作图形用户界面(GUI)的智能体(Agent)正逐渐成为现实。然而,在通往通用 计算机控制的道路上,如何让模型精准地将自然语言指令对应到屏幕上的具体元素 —— 即 GUI Grounding 任务,依然是一大难题。 现有的方法,特别是基于验证奖励的强化学习(RLVR),虽然在提升 "指得准"(空间对齐)方面表现出色,却往往在 "指得对"(语义对齐)上遭遇瓶颈。模型常 常陷入 "自信陷阱",在复杂的语义场景下无法通过有效探索找到正确的功能图标。 从 "空间对齐" 到 "语义对齐":被忽视的探索瓶颈 GUI Grounding 任务的核心是将自然语言指令(如 "打开相机")映射到屏幕上的特定元素坐标。研究团队指出,这一任务可以解构为两个正交的维度: 1. 空间对齐(Spatial Alignment):能否精确地定位到元素(即 "指得准")。 2. 语义对齐(Semantic Alignment):能否识别出功能正确的元素(即 "指得对")。 针对这一痛点,来自浙江大学、香港理工大学及 InfiX.ai 的研究团队提出了一种全新的 自适应探索 ...