Workflow
多模态大语言模型
icon
Search documents
第二届CVPR 2026 CV4CHL Workshop征稿启动,用AI大模型守护儿童未来
机器之心· 2026-01-22 03:13
Core Insights - The article discusses the rapid development of multimodal large language models and embodied AI, highlighting that AI and computer vision technologies focused on children's development, health, and education are still in their infancy [2] - The CV4CHL workshop aims to bridge interdisciplinary perspectives on pediatric AI and computer vision solutions, addressing critical gaps in the field [2] Event Details - The CV4CHL workshop is organized by PediaMed AI in collaboration with several prestigious institutions, including the University of Illinois Urbana-Champaign, Hong Kong University of Science and Technology (Guangzhou), ETH Zurich, and Shenzhen Children's Hospital [2] - The workshop will take place during CVPR 2026, scheduled for June 3-7, 2026, in Denver, Colorado, USA [7][6] Key Topics - The workshop will cover various themes, including: - Basic models inspired by human children's learning and cognitive abilities, and cutting-edge research on multimodal large language models [6] - Brain-computer interface technologies for children [6] - Frontiers in human-computer interaction with augmented reality glasses and smart glasses for children [6] - Applications of embodied AI in pediatrics [6] - Computer vision and foundational models related to children's cognitive development, such as gaze and gesture analysis [6] - Pediatric smart healthcare, including early disease screening and medical imaging and video analysis [6] - AI-enabled education, including smart educational tools and assistive technologies for children with special needs [6] - AI support for children's and adolescents' mental health [6] - Ethical and social implications of children's AI technologies, including privacy protection and human-robot interaction [6] Submission Information - The submission deadline for the workshop is March 31, 2026, with notification of review results by April 8, 2026 [6] - The workshop will feature both proceeding and non-proceeding submission tracks, with specific page limits for each [8]
【重磅深度/文远知行】立足国内发力海外,RoboX商业化落地龙头
Investment Highlights - The company, WeRide, established in 2017, is a leading L4 autonomous driving company with a diverse product line including Robotaxi, Robobus, Robovan, and Robosweeper, alongside L2+ driver assistance services. As of Q3 2025, the total revenue reached 171 million yuan, a year-on-year increase of 144%, with Robotaxi business being the core growth driver, contributing approximately 35.3 million yuan, a staggering increase of 761.0% year-on-year, accounting for 20.7% of total revenue. The gross margin stood at 32.9%, with a net loss of 307.3 million yuan. As of September 30, 2025, cash and capital reserves amounted to 5.4 billion yuan, supporting R&D investments and scaling expansion for long-term competitiveness [3][4]. L4 Industry Overview - The company is the only entity globally to have obtained autonomous driving licenses in eight countries. In China, it has achieved fully unmanned commercial operations in Beijing and Guangzhou, with each Robotaxi completing up to 25 rides per day during operational hours. The company has also received qualifications for unmanned demonstration applications in Shanghai [4][5]. - The Robotaxi business is accelerating towards a commercial turning point, with a clear path to profitability. The integration of end-to-end architecture and advanced technologies has significantly improved safety and reduced accident rates compared to human drivers. The BOM cost has decreased from over 1 million yuan to below 300,000 yuan, with ongoing optimization of the unit economic model [5][30]. - The market for Robotaxi in China is expected to reach 200 billion yuan by 2030, with the potential to replace parts of the traditional and private transportation markets. The theoretical reach of Robotaxi in developed and underdeveloped regions is estimated to be 4.4 and 3.4 times that of the Chinese market, respectively [5][41][45]. Company Analysis - WeRide is positioned as a technology leader in the Robotaxi sector, benefiting from gradual policy openings, continuous breakthroughs in autonomous driving technology, and cost reductions in the supply chain. Revenue projections for 2025-2027 are 555 million, 945 million, and 1.987 billion yuan, respectively, with corresponding price-to-sales ratios of 43.0, 25.2, and 12.0 times [7]. - The company has established a solid equity structure with significant investments from industry players, including Nvidia and major automotive companies, totaling over 1.1 billion USD in funding prior to its IPO [56][57]. - The governance structure is stable, with the founder holding a significant voting power, ensuring effective decision-making [60]. Financial Analysis - The company is currently in an investment phase, with L2+ and L4 businesses contributing to revenue. The service revenue has seen rapid growth due to partnerships, while product revenue is derived from various autonomous vehicle models [66]. - Cash reserves are robust, with 5.4 billion yuan available as of Q3 2025, bolstered by successful financing rounds, ensuring liquidity and supporting ongoing operations [71]. Technological Core - WeRide's competitive edge is built on its self-developed technology stack, including the WeRideOne platform, which integrates advanced driving algorithms and a comprehensive sensor suite for enhanced safety and operational efficiency [75].
卓驭创始人沈劭劼:2026,智驾要从“端到端” 到“端到所有地方”
Xin Lang Cai Jing· 2026-01-11 05:53
Core Insights - The autonomous driving industry is experiencing significant turbulence, with companies like Maomao Zhixing facing collapse despite strong backing and funding, while others like Zhuoyu Technology secure substantial investments [2] - The competitive landscape has shifted from rule-driven to data-driven models, emphasizing the importance of rapid iteration and efficiency in development cycles [3][4] Company Developments - Zhuoyu Technology announced a strategic investment exceeding 3.6 billion yuan from China FAW, highlighting its growth amidst industry challenges [2] - The founder of Zhuoyu, Shen Shaojie, noted that the company's model iteration cycle has been reduced to weekly updates, significantly improving project delivery times from six months to just over one month [3] Industry Trends - Companies that fail to transition to a data-driven development paradigm are at risk of being eliminated from the market [4][5] - The core competitive factor in the intelligent driving sector is the ability to integrate data-driven approaches with traditional manufacturing processes [5] Transformation Challenges - Transitioning to a data-driven model has been challenging for teams traditionally focused on rule-based systems, as exemplified by Zhuoyu's decision to delete its original codebase [6] - The company has shifted its safety protocols from relying on numerous rules to a comprehensive evaluation system, emphasizing data quality over quantity [6] Engineering and Operational Changes - The integration of data-driven methodologies into all aspects of operations is crucial for the success of intelligent driving solutions [7] - Zhuoyu's engineering processes have evolved, with a focus on maintaining a disciplined approach to problem-solving without adding rules that could complicate models [10] Future Outlook - The competition in the intelligent driving industry is expected to intensify, with significant breakthroughs anticipated in 2026 [10][11] - Zhuoyu aims to expand its technology across various vehicle models and scenarios, leveraging a "base model" strategy that allows for customization by automotive manufacturers [13]
空间智能终极挑战MMSI-Video-Bench来了
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the launch of the MMSI-Video-Bench, a comprehensive benchmark for evaluating spatial intelligence in multimodal large language models (MLLMs), emphasizing the need for models to understand and interact with complex real-world environments [1][5][25]. Group 1: Benchmark Features - MMSI-Video-Bench is designed with a systematic approach to assess models' spatial perception capabilities, focusing on spatial construction and motion understanding [5][6]. - The benchmark evaluates high-level decision-making abilities based on spatiotemporal information, including memory update and multi-view integration [6][7]. - It consists of five main task types and 13 subcategories, covering planning and prediction capabilities [9]. Group 2: Model Performance - The benchmark revealed that even the best-performing model, Gemini 3 Pro, achieved only 38% accuracy, indicating a significant performance gap of nearly 60% compared to human levels [10][14]. - The evaluation highlighted deficiencies in models' spatial construction, motion understanding, planning, and prediction capabilities [14][16]. - Detailed error analysis identified five main types of errors affecting model performance, including detailed grounding errors and geometric reasoning errors [16][20]. Group 3: Data Sources and Evaluation - The video data for MMSI-Video-Bench is sourced from 25 public datasets and one self-built dataset, encompassing various real-world scenarios [11]. - The benchmark allows for targeted assessments of specific capabilities in indoor scene perception, robotics, and grounding [11]. Group 4: Future Directions - The article suggests that introducing 3D spatial cues could enhance model understanding and reasoning capabilities [21][26]. - It emphasizes the ongoing challenge of designing models that can effectively utilize spatial cues and highlights that current failures are rooted in fundamental reasoning limitations rather than a lack of explicit reasoning steps [26].
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the transformative advancements in robotic manipulation driven by the rapid development of visual, language, and multimodal learning, emphasizing the role of large foundation models in enhancing robots' perception and semantic representation capabilities [1][2]. Group 1: High-Level Planning - High-level planning is responsible for clarifying action intentions, organizing sequences, and allocating environmental attention, providing structured guidance for low-level execution [4]. - The core components of high-level planning include task decomposition and decision guidance, integrating multimodal information to address "what to do" and "in what order" [4]. - Task planning based on large language models (LLMs) maps natural language to task steps, with methods like SayCan and Grounded Decoding enhancing execution skill selection and planning capabilities [5]. - Multimodal large language models (MLLMs) break the limitations of pure text input by integrating visual and language reasoning, with models like PaLM-E and VILA demonstrating superior performance in embodied tasks [8]. - Code generation techniques convert planning into executable programs, improving the precision of language-based plans through methods like Code as Policies and Demo2Code [9]. - Motion planning utilizes LLMs and VLMs to generate continuous motion targets, linking high-level reasoning with low-level trajectory optimization [10]. - Usability learning focuses on establishing intrinsic associations between perception and action across geometric, visual, semantic, and multimodal dimensions [11]. - 3D scene representation transforms environmental perception into structured action proposals, bridging perception and action through techniques like Gaussian splatting [12]. Group 2: Low-Level Learning Control - Low-level control translates high-level planning into precise physical actions, addressing the "how to do" aspect of robotic manipulation [14]. - Learning strategies for skill acquisition are categorized into three main types, including pre-training and model-free reinforcement learning [16]. - Input modeling defines how robots perceive the world, emphasizing the integration of multimodal signals through reinforcement learning and imitation learning [18]. - Visual-action models utilize both 2D and 3D visual inputs to enhance action generation, while visual-language-action models integrate semantic, spatial, and temporal information [19]. - Additional modalities like tactile and auditory signals improve robustness in contact-rich manipulation scenarios [20]. Group 3: Challenges and Future Directions - Despite significant technological advancements, robotic manipulation faces four core challenges: the lack of universal architectures, data and simulation bottlenecks, insufficient multimodal physical interaction, and safety and collaboration issues [23][27][28][29]. - Future research directions include developing a "robotic brain" for flexible modal interfaces, establishing autonomous data collection mechanisms, enhancing multimodal physical interaction, and ensuring safety in human-robot collaboration [30]. - The review emphasizes the need for a unified framework that integrates high-level planning and low-level control, with a focus on overcoming data efficiency, physical interaction, and safety collaboration bottlenecks to facilitate the transition of robotic manipulation from laboratory settings to real-world applications [31].
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
Core Insights - The article discusses the importance of spatial understanding capabilities in multimodal large language models (MLLMs) for their transition into real-world applications as "general intelligent assistants" [2] - It highlights the limitations of existing spatial intelligence evaluation benchmarks, which either rely heavily on template generation or focus on specific spatial tasks, making it difficult to comprehensively assess models' spatial understanding and reasoning abilities in real-world scenarios [2] Group 1: Introduction of MMSI-Video-Bench - The Shanghai Artificial Intelligence Laboratory's InternRobotics team has launched a comprehensive and rigorous spatial intelligence video benchmark called MMSI-Video-Bench, designed to challenge current mainstream multimodal models [2][6] - The benchmark aims to evaluate models' spatial perception, reasoning, and decision-making capabilities in complex and dynamic real-world environments [2][7] Group 2: Benchmark Characteristics - MMSI-Video-Bench features a systematic design of question types that assess models' basic spatial perception abilities based on spatiotemporal information [6] - It includes high-level decision-making evaluations and extends task categories to cover complex real-world scenarios, testing models' cross-video reasoning capabilities, memory update abilities, and multi-view integration [6][8] - The benchmark consists of five major task types and 13 subcategories, ensuring a comprehensive evaluation of spatial intelligence [10] Group 3: Challenge and Performance - The benchmark's questions are designed to be highly challenging, with all models tested, including the best-performing Gemini 3 Pro, achieving only a 38% accuracy rate, indicating a significant performance gap of approximately 60% compared to human levels [10][14] - The evaluation reveals that models struggle with spatial construction, motion understanding, planning, prediction, and cross-video reasoning, highlighting critical bottlenecks in their capabilities [14][15] Group 4: Error Analysis - The research team identified five main types of errors affecting model performance: detailed grounding errors, ID mapping errors, latent logical inference errors, prompt alignment errors, and geometric reasoning errors [17][21] - Geometric reasoning errors were found to be the most prevalent, significantly impacting performance, particularly in spatial construction tasks [19][21] Group 5: Future Directions - The article suggests that introducing 3D spatial cues could assist models in understanding spatial relationships better, indicating a potential direction for future research [22][24] - It emphasizes the need for effective design of spatial cues that models can truly understand and utilize, as current failures are attributed to underlying reasoning capabilities rather than a lack of explicit reasoning steps [27]
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
Core Insights - The article discusses the revolutionary advancements in video reasoning through the introduction of the "Thinking with Videos" paradigm, specifically the Video-Thinker model, which enhances the model's ability to autonomously navigate and understand temporal sequences in videos [2][6][10]. Group 1: Model Development and Methodology - Video-Thinker integrates "temporal grounding" and "visual captioning" into the model's cognitive chain, eliminating reliance on external tools and enabling the model to autonomously identify key frames and extract visual cues [2][10]. - The research team constructed the Video-Thinker-10K dataset, consisting of 10,000 high-quality samples, and employed a two-phase training strategy of "supervised fine-tuning + reinforcement learning" to enhance the model's self-exploration and self-correction capabilities [3][10]. - The model achieved state-of-the-art (SOTA) performance in various challenging video reasoning benchmarks, significantly surpassing existing baselines with its 7 billion parameters [3][22]. Group 2: Data Quality and Training Process - The construction of high-quality training data is crucial for developing complex reasoning capabilities, leading to the integration of six major datasets into Video-Thinker-10K, which combines precise temporal annotations with detailed visual descriptions [12][13]. - The training process involved a structured thinking paradigm where the model learns to output specific labels such as <time> and <caption>, ensuring a rigorous "locate - perceive - reason" sequence [16][18]. - The reinforcement learning phase, utilizing Group Relative Policy Optimization (GRPO), allowed the model to explore and optimize its reasoning strategies, leading to emergent cognitive behaviors akin to human metacognition [19][22]. Group 3: Performance Evaluation - Video-Thinker-7B demonstrated significant advantages across various video reasoning benchmarks, establishing a new SOTA for models with 7 billion parameters [25][29]. - The model's performance was evaluated through both in-domain and out-of-domain assessments, showcasing its ability to generalize effectively to unseen scenarios [24][29]. - The model achieved an accuracy of 43.22% on the Video-Holmes benchmark and 80.69% on the VRBench, outperforming previous models by notable margins [29][30]. Group 4: Key Findings and Implications - The model's success is attributed to its internal capabilities of grounding and captioning, which were quantitatively assessed and found to be superior to those of baseline models [32][36]. - The findings indicate that relying on external tools can hinder performance, as demonstrated by experiments showing that simple plug-and-play tools did not enhance, but rather degraded, the model's reasoning capabilities [34][35]. - The article concludes that Video-Thinker's approach of integrating core internal capabilities rather than depending on large parameters and datasets represents a new paradigm in video reasoning, with potential applications across various industries [39].
NeurIPS 2025 | 告别全量扫描!浙大提出COIDO:破解多模态数据选择「高耗」难题
机器之心· 2025-12-13 08:31
Core Insights - The article introduces COIDO (Coupled Importance-Diversity Optimization), a framework designed to optimize data selection for visual instruction tuning in multi-modal large language models (MLLMs) [4][9][23] - COIDO aims to reduce the computational costs associated with data selection while ensuring high-quality data is retained, addressing the challenges of existing methods that often require full data traversal [12][23] Group 1: Motivation and Background - The rapid growth of datasets, such as LLaVA-665K, has led to significant computational overhead and redundancy when fine-tuning MLLMs on full datasets [8] - Existing data selection methods face two main issues: high selection costs and the decoupling of importance and diversity in data selection [12][9] Group 2: Methodology - COIDO introduces a lightweight scoring mechanism that allows for training on a small sample (e.g., 20%) of the full dataset, enabling generalization without the need for full data traversal [14] - The core innovation of COIDO is the coupled optimization of importance and diversity within a unified training framework, rather than treating them as separate phases [14] - The importance loss is based on a reweighted cross-entropy loss, while the diversity loss utilizes spectral clustering to minimize variance among clusters, ensuring a diverse data selection [14][15] Group 3: Experimental Results - COIDO achieves state-of-the-art performance using only 20% of the data, reaching 98.2% of the performance of full data fine-tuning across various benchmarks [20][21] - The framework demonstrates strong generalization and transferability, outperforming models trained from scratch on new datasets [21] Group 4: Conclusion - COIDO presents a novel paradigm for multi-modal data selection, challenging the notion that data selection must be costly and providing a pathway for efficient fine-tuning of MLLMs [23][24] - The framework's low computational cost and high-quality data selection make it a valuable tool for researchers with limited resources [23]
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
腾讯广告算法大赛圆满结束,多位选手现场获得腾讯Offer意向书
Sou Hu Cai Jing· 2025-11-28 04:16
Core Insights - The 2025 Tencent Algorithm Competition successfully held its finals in Shenzhen, with over 2800 teams participating globally, focusing on "multi-modal generative recommendation" [1][5] - The champion team "Echoch," consisting of members from Huazhong University of Science and Technology, Peking University, and University of Science and Technology of China, was awarded Tencent's offer and cash prizes [1] - The competition attracted over 8400 participants from nearly 30 countries, marking a historical high for overseas registrations [5] Competition Overview - The finals featured 20 teams that excelled in a rigorous selection process, showcasing innovative generative recommendation algorithms [1] - A special technical innovation award of 200,000 yuan was granted to the team "料峭春风吹酒醒" from the Institute of Computing Technology, Chinese Academy of Sciences [1] Technological Insights - The competition emphasized the application of advanced technologies such as LLM (Large Language Models) and MLLM (Multi-modal Large Language Models), leading to significant innovations in model performance [3] - The generative recommendation technology is seen as crucial for enhancing advertising precision and user experience, allowing for personalized ad recommendations [5] Industry Implications - Tencent's Vice President, Jiang Jie, highlighted the competition's role in attracting young talent to AI, reinforcing Tencent's commitment to technological innovation and collaboration between academia and industry [3] - The competition's dataset will be open-sourced post-event to foster further academic and industrial technological exchanges [5] Business Development - Tencent's Q3 financial report introduced the "Tencent Advertising AIM+" smart advertising product matrix, which optimizes marketing returns for advertisers [6] - The ongoing exploration of generative recommendation technologies within Tencent's advertising business aims to enhance user experience and drive commercial growth [6]