Workflow
多模态大语言模型
icon
Search documents
卓驭创始人沈劭劼:2026,智驾要从“端到端” 到“端到所有地方”
Xin Lang Cai Jing· 2026-01-11 05:53
来源:钛媒体 卓驭创始人沈劭劼 这种转型对任何团队来说都意味着阵痛。沈劭劼坦诚,作为传统机器人学派,卓驭团队曾长期执着于规则驱动,坚信"物理世界的模型是我建立的"。 然而,当行业向端到端转型时,卓驭不得不面对"打不过"的事实,因此,2024年10月14日,卓驭做出了一个艰难决定:全删原有代码库,正式宣布"以后只 有端到端,没有规则"。那一刻,3000行代码被清除,也清除了团队对规则驱动的路径依赖。 2025年年底,当毫末智行骤然倒下的消息传来时,整个自动驾驶行业都为之一震。一家背靠巨头、融资充足的公司,依然没能穿越周期。 几乎在同一时间,另一家备受关注的智驾公司——卓驭科技,官宣获得中国一汽超36亿元的战略投资。 这样的波动并非孤例。更早一些的5月,大卓智能宣布解散,业务并入奇瑞智能化中心;随后,中智行因无力支付小额劳动仲裁款,被法院裁定破产清算。 整合与出清,正以超乎以往的速度进行。 当整个行业已经从规则驱动全面倒向数据驱动的端到端模型。当所有头部玩家都站上同一条新起跑线,竞争的核心便不再是"谁先出发",而是"谁的迭代系 统更快、更高效"。领跑与追赶的差距,往往只在于一次成功模型更新的时间窗口。 "国内头部 ...
空间智能终极挑战MMSI-Video-Bench来了
具身智能之心· 2026-01-06 00:32
编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 空间理解能力是多模态大语言模型(MLLMs)走向真实物理世界,成为 "通用型智能助手" 的关键基础。但现有的空间智能评测基准往往有两类问题:一类高度依 赖模板生成,限制了问题的多样性;另一类仅聚焦于某一种空间任务与受限场景,因此很难全面检验模型在真实世界中对空间的理解与推理能力。 要真正走入现实世界,模型不仅需要看得见,更要看得懂空间: 它需要在复杂、多变的真实场景中理解空间布局、感知运动变化、进行时空推理,并基于这些信 息做出合理决策,与环境产生有效交互。 为此, 上海人工 智能 实 验室 Inter nRobotics 团队 近日推出了一套 全面而硬核的空间智能视频基准 —— MMSI-Video-Bench ,对当前主流多模态大模型精心打 造了一场挑战系数极高的 "空间智能大考"。 本工作由上海人工智能实验室、上海交通大学、香港中文大学、浙江大学、香港大学、北京航空航天大学、西安交通大学、 ...
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在具身智能领域,机器人操纵作为核心难题,随着视觉、语言及多模态学习的飞速发展迎来变革。大型基础模型的出现,大幅提升了机器人的感知与语义表征能 力,使其能在非结构化环境中基于自然语言指令完成任务。由西安交通大学、香港科技大学(广州)等多所高校联合撰写的综述,以 "高层规划 + 低层控制" 的统一 框架,系统梳理了基于学习的机器人操纵方法,明确了当前技术瓶颈与未来方向,为该领域的研究提供了全面且结构化的参考。 论文名称:Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives 论文链接:https://arxiv.org/pdf/2512.22983 项目链接:https://github.com/BaiShuangha ...
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
空间理解能力是多模态大语言模型(MLLMs)走向真实物理世界,成为 "通用型智能助手" 的关键基础。但现有的空间智能评测基准往往有两类问题:一类高度依 赖模板生成,限制了问题的多样性;另一类仅聚焦于某一种空间任务与受限场景,因此很难全面检验模型在真实世界中对空间的理解与推理能力。 要真正走入现实世界,模型不仅需要看得见,更要看得懂空间: 它需要在复杂、多变的真实场景中理解空间布局、感知运动变化、进行时空推理,并基于这些信 息做出合理决策,与环境产生有效交互。 为此, 上海人工 智能 实 验室 Inter nRobotics 团队 近日推出了一套 全面而硬核的空间智能视频基准 —— MMSI-Video-Bench ,对当前主流多模态大模型精心打 造了一场挑战系数极高的 "空间智能大考"。 本工作由上海人工智能实验室、上海交通大学、香港中文大学、浙江大学、香港大学、北京航空航天大学、西安交通大学、复旦大学、加州大学洛杉机分校 的研 究者们共同完成。 Hugging Face 数据集: https://huggingface.co/datasets/rbler/MMSI-Video-Bench GitHub 代码 ...
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
随着多模态大语言模型(MLLM)的飞速发展,"Thinking with Images" 范式已在图像理解和推理任务上取得了革命性突破 —— 模型不再是被动接收视觉信息,而 是学会了主动定位与思考。 然而,当面对包含复杂时序依赖与动态叙事的视频推理任务时,这一能力尚未得到有效延伸。现有的视频推理方法往往受限于对外部工具的依赖或预设的提示词 策略,难以让模型内生出对时间序列的自主导航与深度理解能力,导致模型在处理长视频或复杂逻辑时显得捉襟见肘。 为攻克这一难题,来自小红书的研究团队提出了 Video-Thinker:一种全新的 "Thinking with Videos" 范式,旨在通过强化学习激发 MLLM 在视频推理中的内生智 能。 与传统方法不同, Video-Thinker 不依赖构建和调用外部工具,而是将 "时序定位(Grounding)" 与 "视觉描述(Captioning)" 这两种核心能力内化在模型的思 维链(CoT)中,使其能在推理过程中自主寻找关键帧并提取视觉线索。 团队精心构建了包含 10K 高质量样本的 Video-Thinker-10K 数据集,并采用 "监督微调 + 强化学习" 的 ...
NeurIPS 2025 | 告别全量扫描!浙大提出COIDO:破解多模态数据选择「高耗」难题
机器之心· 2025-12-13 08:31
Core Insights - The article introduces COIDO (Coupled Importance-Diversity Optimization), a framework designed to optimize data selection for visual instruction tuning in multi-modal large language models (MLLMs) [4][9][23] - COIDO aims to reduce the computational costs associated with data selection while ensuring high-quality data is retained, addressing the challenges of existing methods that often require full data traversal [12][23] Group 1: Motivation and Background - The rapid growth of datasets, such as LLaVA-665K, has led to significant computational overhead and redundancy when fine-tuning MLLMs on full datasets [8] - Existing data selection methods face two main issues: high selection costs and the decoupling of importance and diversity in data selection [12][9] Group 2: Methodology - COIDO introduces a lightweight scoring mechanism that allows for training on a small sample (e.g., 20%) of the full dataset, enabling generalization without the need for full data traversal [14] - The core innovation of COIDO is the coupled optimization of importance and diversity within a unified training framework, rather than treating them as separate phases [14] - The importance loss is based on a reweighted cross-entropy loss, while the diversity loss utilizes spectral clustering to minimize variance among clusters, ensuring a diverse data selection [14][15] Group 3: Experimental Results - COIDO achieves state-of-the-art performance using only 20% of the data, reaching 98.2% of the performance of full data fine-tuning across various benchmarks [20][21] - The framework demonstrates strong generalization and transferability, outperforming models trained from scratch on new datasets [21] Group 4: Conclusion - COIDO presents a novel paradigm for multi-modal data selection, challenging the notion that data selection must be costly and providing a pathway for efficient fine-tuning of MLLMs [23][24] - The framework's low computational cost and high-quality data selection make it a valuable tool for researchers with limited resources [23]
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
腾讯广告算法大赛圆满结束,多位选手现场获得腾讯Offer意向书
Sou Hu Cai Jing· 2025-11-28 04:16
Core Insights - The 2025 Tencent Algorithm Competition successfully held its finals in Shenzhen, with over 2800 teams participating globally, focusing on "multi-modal generative recommendation" [1][5] - The champion team "Echoch," consisting of members from Huazhong University of Science and Technology, Peking University, and University of Science and Technology of China, was awarded Tencent's offer and cash prizes [1] - The competition attracted over 8400 participants from nearly 30 countries, marking a historical high for overseas registrations [5] Competition Overview - The finals featured 20 teams that excelled in a rigorous selection process, showcasing innovative generative recommendation algorithms [1] - A special technical innovation award of 200,000 yuan was granted to the team "料峭春风吹酒醒" from the Institute of Computing Technology, Chinese Academy of Sciences [1] Technological Insights - The competition emphasized the application of advanced technologies such as LLM (Large Language Models) and MLLM (Multi-modal Large Language Models), leading to significant innovations in model performance [3] - The generative recommendation technology is seen as crucial for enhancing advertising precision and user experience, allowing for personalized ad recommendations [5] Industry Implications - Tencent's Vice President, Jiang Jie, highlighted the competition's role in attracting young talent to AI, reinforcing Tencent's commitment to technological innovation and collaboration between academia and industry [3] - The competition's dataset will be open-sourced post-event to foster further academic and industrial technological exchanges [5] Business Development - Tencent's Q3 financial report introduced the "Tencent Advertising AIM+" smart advertising product matrix, which optimizes marketing returns for advertisers [6] - The ongoing exploration of generative recommendation technologies within Tencent's advertising business aims to enhance user experience and drive commercial growth [6]
李飞飞长文火爆硅谷
投资界· 2025-11-14 08:01
Core Insights - The article emphasizes that spatial intelligence is the next frontier for AI, which can revolutionize creativity, robotics, scientific discovery, and more [6][10][14] - It outlines the three core capabilities that a world model must possess: generative, multimodal, and interactive [4][18][19] Group 1: Importance of Spatial Intelligence - Spatial intelligence is foundational to human cognition and influences how individuals interact with the physical world [11][14] - Historical examples illustrate how spatial intelligence has driven significant advancements in civilization, such as Eratosthenes' calculation of the Earth's circumference and Watson and Crick's discovery of DNA structure [12][13] Group 2: Current Limitations of AI - Current AI models, particularly large language models (LLMs), lack the spatial reasoning capabilities that humans possess, limiting their effectiveness in understanding and interacting with the physical world [15][16] - Despite advancements, AI struggles with tasks like estimating distances and navigating environments, indicating a fundamental gap in spatial understanding [15][16] Group 3: Future Directions for AI Development - The development of world models is essential for creating AI that can understand and interact with the world in a human-like manner [18][24] - World models should be capable of generating consistent virtual worlds, processing multimodal inputs, and predicting future states based on actions [18][19][20] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various fields, including creativity, robotics, science, medicine, and education [34][35] - In creative industries, tools like World Labs' Marble platform enable creators to build immersive experiences without traditional design constraints [28][29] - In robotics, spatial intelligence can enhance machine learning and human-robot collaboration, making robots more effective in various environments [30][31] Group 5: Vision for the Future - The article envisions a future where AI enhances human capabilities rather than replacing them, emphasizing the importance of aligning AI development with human needs [26][36] - The ultimate goal is to create machines that can understand and interact with the physical world, thereby improving human welfare and addressing significant challenges [38]
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 05:38
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Research Contributions - A new toy dataset was constructed to systematically and independently vary the reasoning difficulty of visual and textual inputs, enabling different difficulty combinations for multi-modal inputs [4]. - The study decomposes the explicit behavior of modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - An empirical finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. Group 2: Framework Design - A controlled dataset was created to validate hypotheses, allowing independent control of visual and textual reasoning complexity [9][10]. - Uncertainty was measured using output entropy, which reflects the model's perceived uncertainty, with lower entropy indicating confident predictions and higher entropy indicating consideration of alternative options [11]. - Relative uncertainty was quantified to measure the confidence gap between text and visual modalities, providing a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies a common trend where models perceive text as easier on average, yet exhibit opposite macro preferences, raising questions about the underlying reasons for these discrepancies [15][16]. Group 4: Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the model's decision-making process [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve, reflecting how model preferences change dynamically with relative difficulty [18]. Group 5: Key Experimental Findings - All tested models exhibited a consistent trend where the probability of following text decreases smoothly as text becomes relatively more difficult [19][21]. - The "balance point" was defined as the point where the curve crosses the 50% probability line, serving as a quantifiable measure of inherent modality preference [22]. - The framework successfully explained previous puzzles regarding model behavior by revealing differences in inherent preferences that were not visible in macro metrics [23][24]. Group 6: Internal Mechanisms - The study explored the internal decision-making mechanisms of models, particularly their oscillation behavior when faced with conflicting information near the balance point [29][30]. - The findings indicate that models exhibit higher oscillation counts in ambiguous regions, providing a mechanistic explanation for observed indecision in external behavior [34][36]. Conclusion - The research presents a new framework for understanding modality following in MLLMs, emphasizing the importance of separating model capability from inherent preference, and revealing a robust rule that the likelihood of following a modality decreases with increasing relative uncertainty [37].