多模态大语言模型
Search documents
【重磅深度/文远知行】立足国内发力海外,RoboX商业化落地龙头
东吴汽车黄细里团队· 2026-01-13 13:41
Robotaxi企业依托端到端架构及多模态大语言模型、世界模型等技术,突破了传统辅助驾驶的 局限。安全性上,多传感器融合与车路云一体化协同大幅提升系统可靠性,事故发生率较人类 驾驶可以显著降低。成本端,固态激光雷达量产推动硬件降价,BOM 成本已从百万级下探至 30 万元内,单位经济模型持续优化,盈利路径清晰。市场方面,中国B端共享出行规模稳步扩 张,Robotaxi有望替代部分传统及私人出行市场,2030年乐观预期规模达2000亿元。据我们测 算,2024年海外的发达地区/欠发达地区Robotaxi的理论触达空间是中国Robotaxi市场的4.4/3.4 倍,考虑需替代的共享出行车辆数量差距,市场优质程度呈现发达地区>中国>欠发达地区的格 局,抢占发达地区Robotaxi的卡位优势或为核心竞争点。政策上,中国51个城市开放试点,多 地开展全无人驾驶运营;中东国家战略驱动 愿景宏大;新加坡积极引进 审慎开放 促进各 公司2017年成立,创始人为计算机视觉、机器学习领域专家,具有百度、微软工作背景,业务 从Robotaxi拓展至Robobus、Robovan、Robosweeper等多种L4场景,同时布局L2+ ...
卓驭创始人沈劭劼:2026,智驾要从“端到端” 到“端到所有地方”
Xin Lang Cai Jing· 2026-01-11 05:53
来源:钛媒体 卓驭创始人沈劭劼 这种转型对任何团队来说都意味着阵痛。沈劭劼坦诚,作为传统机器人学派,卓驭团队曾长期执着于规则驱动,坚信"物理世界的模型是我建立的"。 然而,当行业向端到端转型时,卓驭不得不面对"打不过"的事实,因此,2024年10月14日,卓驭做出了一个艰难决定:全删原有代码库,正式宣布"以后只 有端到端,没有规则"。那一刻,3000行代码被清除,也清除了团队对规则驱动的路径依赖。 2025年年底,当毫末智行骤然倒下的消息传来时,整个自动驾驶行业都为之一震。一家背靠巨头、融资充足的公司,依然没能穿越周期。 几乎在同一时间,另一家备受关注的智驾公司——卓驭科技,官宣获得中国一汽超36亿元的战略投资。 这样的波动并非孤例。更早一些的5月,大卓智能宣布解散,业务并入奇瑞智能化中心;随后,中智行因无力支付小额劳动仲裁款,被法院裁定破产清算。 整合与出清,正以超乎以往的速度进行。 当整个行业已经从规则驱动全面倒向数据驱动的端到端模型。当所有头部玩家都站上同一条新起跑线,竞争的核心便不再是"谁先出发",而是"谁的迭代系 统更快、更高效"。领跑与追赶的差距,往往只在于一次成功模型更新的时间窗口。 "国内头部 ...
空间智能终极挑战MMSI-Video-Bench来了
具身智能之心· 2026-01-06 00:32
编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 空间理解能力是多模态大语言模型(MLLMs)走向真实物理世界,成为 "通用型智能助手" 的关键基础。但现有的空间智能评测基准往往有两类问题:一类高度依 赖模板生成,限制了问题的多样性;另一类仅聚焦于某一种空间任务与受限场景,因此很难全面检验模型在真实世界中对空间的理解与推理能力。 要真正走入现实世界,模型不仅需要看得见,更要看得懂空间: 它需要在复杂、多变的真实场景中理解空间布局、感知运动变化、进行时空推理,并基于这些信 息做出合理决策,与环境产生有效交互。 为此, 上海人工 智能 实 验室 Inter nRobotics 团队 近日推出了一套 全面而硬核的空间智能视频基准 —— MMSI-Video-Bench ,对当前主流多模态大模型精心打 造了一场挑战系数极高的 "空间智能大考"。 本工作由上海人工智能实验室、上海交通大学、香港中文大学、浙江大学、香港大学、北京航空航天大学、西安交通大学、 ...
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在具身智能领域,机器人操纵作为核心难题,随着视觉、语言及多模态学习的飞速发展迎来变革。大型基础模型的出现,大幅提升了机器人的感知与语义表征能 力,使其能在非结构化环境中基于自然语言指令完成任务。由西安交通大学、香港科技大学(广州)等多所高校联合撰写的综述,以 "高层规划 + 低层控制" 的统一 框架,系统梳理了基于学习的机器人操纵方法,明确了当前技术瓶颈与未来方向,为该领域的研究提供了全面且结构化的参考。 论文名称:Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives 论文链接:https://arxiv.org/pdf/2512.22983 项目链接:https://github.com/BaiShuangha ...
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
空间理解能力是多模态大语言模型(MLLMs)走向真实物理世界,成为 "通用型智能助手" 的关键基础。但现有的空间智能评测基准往往有两类问题:一类高度依 赖模板生成,限制了问题的多样性;另一类仅聚焦于某一种空间任务与受限场景,因此很难全面检验模型在真实世界中对空间的理解与推理能力。 要真正走入现实世界,模型不仅需要看得见,更要看得懂空间: 它需要在复杂、多变的真实场景中理解空间布局、感知运动变化、进行时空推理,并基于这些信 息做出合理决策,与环境产生有效交互。 为此, 上海人工 智能 实 验室 Inter nRobotics 团队 近日推出了一套 全面而硬核的空间智能视频基准 —— MMSI-Video-Bench ,对当前主流多模态大模型精心打 造了一场挑战系数极高的 "空间智能大考"。 本工作由上海人工智能实验室、上海交通大学、香港中文大学、浙江大学、香港大学、北京航空航天大学、西安交通大学、复旦大学、加州大学洛杉机分校 的研 究者们共同完成。 Hugging Face 数据集: https://huggingface.co/datasets/rbler/MMSI-Video-Bench GitHub 代码 ...
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
随着多模态大语言模型(MLLM)的飞速发展,"Thinking with Images" 范式已在图像理解和推理任务上取得了革命性突破 —— 模型不再是被动接收视觉信息,而 是学会了主动定位与思考。 然而,当面对包含复杂时序依赖与动态叙事的视频推理任务时,这一能力尚未得到有效延伸。现有的视频推理方法往往受限于对外部工具的依赖或预设的提示词 策略,难以让模型内生出对时间序列的自主导航与深度理解能力,导致模型在处理长视频或复杂逻辑时显得捉襟见肘。 为攻克这一难题,来自小红书的研究团队提出了 Video-Thinker:一种全新的 "Thinking with Videos" 范式,旨在通过强化学习激发 MLLM 在视频推理中的内生智 能。 与传统方法不同, Video-Thinker 不依赖构建和调用外部工具,而是将 "时序定位(Grounding)" 与 "视觉描述(Captioning)" 这两种核心能力内化在模型的思 维链(CoT)中,使其能在推理过程中自主寻找关键帧并提取视觉线索。 团队精心构建了包含 10K 高质量样本的 Video-Thinker-10K 数据集,并采用 "监督微调 + 强化学习" 的 ...
NeurIPS 2025 | 告别全量扫描!浙大提出COIDO:破解多模态数据选择「高耗」难题
机器之心· 2025-12-13 08:31
Core Insights - The article introduces COIDO (Coupled Importance-Diversity Optimization), a framework designed to optimize data selection for visual instruction tuning in multi-modal large language models (MLLMs) [4][9][23] - COIDO aims to reduce the computational costs associated with data selection while ensuring high-quality data is retained, addressing the challenges of existing methods that often require full data traversal [12][23] Group 1: Motivation and Background - The rapid growth of datasets, such as LLaVA-665K, has led to significant computational overhead and redundancy when fine-tuning MLLMs on full datasets [8] - Existing data selection methods face two main issues: high selection costs and the decoupling of importance and diversity in data selection [12][9] Group 2: Methodology - COIDO introduces a lightweight scoring mechanism that allows for training on a small sample (e.g., 20%) of the full dataset, enabling generalization without the need for full data traversal [14] - The core innovation of COIDO is the coupled optimization of importance and diversity within a unified training framework, rather than treating them as separate phases [14] - The importance loss is based on a reweighted cross-entropy loss, while the diversity loss utilizes spectral clustering to minimize variance among clusters, ensuring a diverse data selection [14][15] Group 3: Experimental Results - COIDO achieves state-of-the-art performance using only 20% of the data, reaching 98.2% of the performance of full data fine-tuning across various benchmarks [20][21] - The framework demonstrates strong generalization and transferability, outperforming models trained from scratch on new datasets [21] Group 4: Conclusion - COIDO presents a novel paradigm for multi-modal data selection, challenging the notion that data selection must be costly and providing a pathway for efficient fine-tuning of MLLMs [23][24] - The framework's low computational cost and high-quality data selection make it a valuable tool for researchers with limited resources [23]
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
腾讯广告算法大赛圆满结束,多位选手现场获得腾讯Offer意向书
Sou Hu Cai Jing· 2025-11-28 04:16
Core Insights - The 2025 Tencent Algorithm Competition successfully held its finals in Shenzhen, with over 2800 teams participating globally, focusing on "multi-modal generative recommendation" [1][5] - The champion team "Echoch," consisting of members from Huazhong University of Science and Technology, Peking University, and University of Science and Technology of China, was awarded Tencent's offer and cash prizes [1] - The competition attracted over 8400 participants from nearly 30 countries, marking a historical high for overseas registrations [5] Competition Overview - The finals featured 20 teams that excelled in a rigorous selection process, showcasing innovative generative recommendation algorithms [1] - A special technical innovation award of 200,000 yuan was granted to the team "料峭春风吹酒醒" from the Institute of Computing Technology, Chinese Academy of Sciences [1] Technological Insights - The competition emphasized the application of advanced technologies such as LLM (Large Language Models) and MLLM (Multi-modal Large Language Models), leading to significant innovations in model performance [3] - The generative recommendation technology is seen as crucial for enhancing advertising precision and user experience, allowing for personalized ad recommendations [5] Industry Implications - Tencent's Vice President, Jiang Jie, highlighted the competition's role in attracting young talent to AI, reinforcing Tencent's commitment to technological innovation and collaboration between academia and industry [3] - The competition's dataset will be open-sourced post-event to foster further academic and industrial technological exchanges [5] Business Development - Tencent's Q3 financial report introduced the "Tencent Advertising AIM+" smart advertising product matrix, which optimizes marketing returns for advertisers [6] - The ongoing exploration of generative recommendation technologies within Tencent's advertising business aims to enhance user experience and drive commercial growth [6]
李飞飞长文火爆硅谷
投资界· 2025-11-14 08:01
Core Insights - The article emphasizes that spatial intelligence is the next frontier for AI, which can revolutionize creativity, robotics, scientific discovery, and more [6][10][14] - It outlines the three core capabilities that a world model must possess: generative, multimodal, and interactive [4][18][19] Group 1: Importance of Spatial Intelligence - Spatial intelligence is foundational to human cognition and influences how individuals interact with the physical world [11][14] - Historical examples illustrate how spatial intelligence has driven significant advancements in civilization, such as Eratosthenes' calculation of the Earth's circumference and Watson and Crick's discovery of DNA structure [12][13] Group 2: Current Limitations of AI - Current AI models, particularly large language models (LLMs), lack the spatial reasoning capabilities that humans possess, limiting their effectiveness in understanding and interacting with the physical world [15][16] - Despite advancements, AI struggles with tasks like estimating distances and navigating environments, indicating a fundamental gap in spatial understanding [15][16] Group 3: Future Directions for AI Development - The development of world models is essential for creating AI that can understand and interact with the world in a human-like manner [18][24] - World models should be capable of generating consistent virtual worlds, processing multimodal inputs, and predicting future states based on actions [18][19][20] Group 4: Applications of Spatial Intelligence - The potential applications of spatial intelligence span various fields, including creativity, robotics, science, medicine, and education [34][35] - In creative industries, tools like World Labs' Marble platform enable creators to build immersive experiences without traditional design constraints [28][29] - In robotics, spatial intelligence can enhance machine learning and human-robot collaboration, making robots more effective in various environments [30][31] Group 5: Vision for the Future - The article envisions a future where AI enhances human capabilities rather than replacing them, emphasizing the importance of aligning AI development with human needs [26][36] - The ultimate goal is to create machines that can understand and interact with the physical world, thereby improving human welfare and addressing significant challenges [38]