多模态大语言模型
Search documents
东方理工团队提出HiDrop:重构MLLM计算路径,压缩90%视觉Token实现2.2倍加速
机器之心· 2026-03-23 11:56
Core Insights - The article discusses the challenges and solutions related to the efficiency bottleneck in multi-modal large language models (MLLMs) due to the increased number of visual tokens compared to text [2][3] - It introduces HiDrop, a novel framework designed to compress visual tokens while maintaining model performance and improving computational efficiency [25] Group 1: MLLM Functionality and Challenges - Existing research typically employs fixed strategies for visual token pruning, neglecting the functional differences across various layers of MLLMs [3] - Analysis reveals that different layers in MLLMs serve distinct roles: shallow layers primarily transmit visual features, middle layers perform cross-modal fusion, and deep layers focus on semantic integration and reasoning [3][9] Group 2: HiDrop Framework - HiDrop employs a three-stage hierarchical alignment compression strategy, aligning visual token processing with the model's layer structure to significantly reduce computational costs while preserving performance [15][16] - The three stages include: 1. Shallow Layer: Delayed injection of visual tokens to minimize computational load without affecting performance [19] 2. Middle Layer: Concave pyramid pruning to aggressively reduce visual tokens, focusing on key tokens that significantly impact text representation [20] 3. Deep Layer: Early exit of visual tokens to streamline processing, allowing subsequent layers to focus on fused semantic representations [21] Group 3: Experimental Results - HiDrop achieves approximately 90% compression of visual tokens while maintaining 98.3% of the original model's performance, demonstrating a superior compression-performance trade-off [4][22] - The method also results in a 1.72× training speedup and a 2.2× pre-filling acceleration, indicating significant improvements in computational efficiency [24][25]
2026LOG中国供应链物流创新科技报告
Sou Hu Cai Jing· 2026-03-19 15:36
Core Insights - The report focuses on the transformation of the supply chain logistics industry driven by AI, emphasizing smart, digital, and automated solutions [1][4] - Key global trends include generative AI, supply chain cybersecurity, ESG, and sustainable development, with a focus on eight technological trends for 2024 [1][7] - The report outlines a development framework for China that integrates AI technology, digitalization, and automation, highlighting the deployment of large models and intelligent decision-making products by companies like JD.com and SF Express [1][4] Global Supply Chain Trends - The supply chain logistics industry is undergoing unprecedented changes due to the dual forces of global supply chain restructuring and digital technology revolution [4] - AI is reshaping value creation in supply chains through intelligent decision-making, digital connectivity, and automated execution [4][11] - The 2024 supply chain technology trends emphasize the importance of AI-driven technologies, cybersecurity, and sustainability [7][11] AI Technology Integration - The report aims to provide a comprehensive view of AI technology implementation in the supply chain logistics sector, offering strategic insights for decision-makers and innovators [5] - AI technologies are expected to enhance efficiency and create value through applications in demand forecasting, inventory optimization, and risk management [4][5] - The integration of AI agents and multi-agent collaboration is anticipated to improve task execution and resource allocation in complex environments [31][33] Industry Developments - Major companies are actively exploring AI applications, with significant advancements in autonomous vehicles, drones, and warehouse automation [1][36] - The establishment of industry alliances focused on large model applications indicates a collaborative approach to innovation in logistics [36][38] - The report highlights the emergence of multi-modal large language models (MLLMs) and embodied intelligence as key trends in AI development for logistics [26][33]
统一离散与连续扩散!人大 & 蚂蚁提出 LLaDA-o,高效达成多模态理解与生成
机器之心· 2026-03-14 04:03
Core Insights - The article discusses the development of LLaDA-o, an efficient and length-adaptive omni diffusion model, which addresses the challenges of integrating discrete text diffusion and continuous image diffusion into a unified framework [3][19]. Group 1: Model Performance - LLaDA-o achieves state-of-the-art (SOTA) performance in both multi-modal understanding and text-to-image generation tasks, marking a significant advancement in the field of multi-modal diffusion models [3][19]. - In multi-modal understanding benchmarks, LLaDA-o outperforms existing diffusion models, achieving notable scores such as 66.1 in MathVista and 87.9 in ChartQA, solidifying its position as the leading model in this category [7][9]. - The model also excels in fine-grained generation tasks, scoring 87.04 in DPG-Bench, surpassing previous strong models like SD3-Medium and Lumina-DiMOO [9][11]. Group 2: Technical Innovations - LLaDA-o employs a Mixture of Diffusion (MoD) framework, which features two specialized diffusion experts: an Understanding Expert for discrete masked diffusion and a Generation Expert for continuous diffusion, allowing for effective optimization across different modalities [12][14]. - The model incorporates intra-modality bidirectional attention to enhance efficiency by reducing redundant calculations during inference, thus improving overall performance [15]. - An adaptive length augmentation strategy is introduced, enabling the model to dynamically adjust output lengths based on context, addressing the challenges of variable-length text generation without altering the underlying architecture [17]. Group 3: Future Implications - The successful integration of discrete language understanding and continuous visual generation within the MoD framework positions LLaDA-o as a strong contender against autoregressive models, paving the way for future developments in non-autoregressive architectures [19][20]. - The ongoing evolution of large language diffusion models suggests that unified models based on diffusion architecture will play a crucial role in the landscape of general artificial intelligence [20].
ICLR 2026 | LongHorizonUI:让 GUI 智能体不再"半途而废"——面向长链路任务的统一鲁棒自动化框架
机器之心· 2026-03-12 08:19
Core Viewpoint - The article discusses the development of LongHorizonUI, a unified framework designed to enhance the automation of long-horizon tasks for GUI agents, addressing the significant drop in success rates when task steps exceed 10-15 [2][5]. Group 1: Research Background - Traditional GUI automation methods struggle with long sequences of operations, showing a success rate drop from over 90% for sequences of 5 steps to below 75% for sequences over 10 steps, and around 60% for sequences exceeding 15 steps [5]. - The research team identified the need for a solution that maintains contextual consistency and decision accuracy in long-step operation sequences [5]. Group 2: Benchmark Development - A new benchmark, LongGUIBench, was created to evaluate long-horizon tasks, with all tasks having a minimum of 15 steps and an average of 22.1 steps [7]. - The dataset includes two categories: general application scenarios with 147 end-to-end task chains averaging 19.5 steps, and gaming scenarios with 207 high-complexity chains averaging 23.7 steps, with the longest reaching 37 steps [7]. Group 3: Core Methodology - LongHorizonUI consists of three core modules: Multi-modal Enhanced Perception (MEP), Deep Reflective Decision (DRD), and Compensatory Executor (CAE), forming a complete loop from perception to execution [9][19]. - MEP enhances perception by assigning unique spatial index IDs to UI elements and addressing ambiguities in composite controls through a semantic binding mechanism [12]. - DRD enforces a three-level reasoning process to ensure decision accuracy, including historical validation, goal checking, and action interpretability [12]. - CAE maps decision outputs to physical screen coordinates, employing multiple strategies to ensure successful execution [13]. Group 4: Experimental Results - LongHorizonUI demonstrated significant advantages in long-horizon tasks, achieving a success rate of 85.3% for low-level instructions and 52.3% for high-level instructions in general scenarios, outperforming previous methods [15]. - In gaming scenarios, the success rates were 83.9% for low-level and 52.1% for high-level instructions, with an overall average of 77.3% [15]. - The framework also achieved a 90.4% average accuracy on the ScreenSpot cross-platform UI element localization benchmark, showcasing its robustness across different platforms [15]. - In a 50-step long chain setting, LongHorizonUI reached a success rate of 29.4%, surpassing previous benchmarks [16]. Group 5: Conclusion - LongHorizonUI provides a comprehensive solution for long-horizon GUI automation tasks, effectively mitigating error accumulation through its structured design, and the LongGUIBench benchmark offers a standardized evaluation platform for future research [19].
第二届CVPR 2026 CV4CHL Workshop征稿启动,用AI大模型守护儿童未来
机器之心· 2026-01-22 03:13
Core Insights - The article discusses the rapid development of multimodal large language models and embodied AI, highlighting that AI and computer vision technologies focused on children's development, health, and education are still in their infancy [2] - The CV4CHL workshop aims to bridge interdisciplinary perspectives on pediatric AI and computer vision solutions, addressing critical gaps in the field [2] Event Details - The CV4CHL workshop is organized by PediaMed AI in collaboration with several prestigious institutions, including the University of Illinois Urbana-Champaign, Hong Kong University of Science and Technology (Guangzhou), ETH Zurich, and Shenzhen Children's Hospital [2] - The workshop will take place during CVPR 2026, scheduled for June 3-7, 2026, in Denver, Colorado, USA [7][6] Key Topics - The workshop will cover various themes, including: - Basic models inspired by human children's learning and cognitive abilities, and cutting-edge research on multimodal large language models [6] - Brain-computer interface technologies for children [6] - Frontiers in human-computer interaction with augmented reality glasses and smart glasses for children [6] - Applications of embodied AI in pediatrics [6] - Computer vision and foundational models related to children's cognitive development, such as gaze and gesture analysis [6] - Pediatric smart healthcare, including early disease screening and medical imaging and video analysis [6] - AI-enabled education, including smart educational tools and assistive technologies for children with special needs [6] - AI support for children's and adolescents' mental health [6] - Ethical and social implications of children's AI technologies, including privacy protection and human-robot interaction [6] Submission Information - The submission deadline for the workshop is March 31, 2026, with notification of review results by April 8, 2026 [6] - The workshop will feature both proceeding and non-proceeding submission tracks, with specific page limits for each [8]
【重磅深度/文远知行】立足国内发力海外,RoboX商业化落地龙头
东吴汽车黄细里团队· 2026-01-13 13:41
Investment Highlights - The company, WeRide, established in 2017, is a leading L4 autonomous driving company with a diverse product line including Robotaxi, Robobus, Robovan, and Robosweeper, alongside L2+ driver assistance services. As of Q3 2025, the total revenue reached 171 million yuan, a year-on-year increase of 144%, with Robotaxi business being the core growth driver, contributing approximately 35.3 million yuan, a staggering increase of 761.0% year-on-year, accounting for 20.7% of total revenue. The gross margin stood at 32.9%, with a net loss of 307.3 million yuan. As of September 30, 2025, cash and capital reserves amounted to 5.4 billion yuan, supporting R&D investments and scaling expansion for long-term competitiveness [3][4]. L4 Industry Overview - The company is the only entity globally to have obtained autonomous driving licenses in eight countries. In China, it has achieved fully unmanned commercial operations in Beijing and Guangzhou, with each Robotaxi completing up to 25 rides per day during operational hours. The company has also received qualifications for unmanned demonstration applications in Shanghai [4][5]. - The Robotaxi business is accelerating towards a commercial turning point, with a clear path to profitability. The integration of end-to-end architecture and advanced technologies has significantly improved safety and reduced accident rates compared to human drivers. The BOM cost has decreased from over 1 million yuan to below 300,000 yuan, with ongoing optimization of the unit economic model [5][30]. - The market for Robotaxi in China is expected to reach 200 billion yuan by 2030, with the potential to replace parts of the traditional and private transportation markets. The theoretical reach of Robotaxi in developed and underdeveloped regions is estimated to be 4.4 and 3.4 times that of the Chinese market, respectively [5][41][45]. Company Analysis - WeRide is positioned as a technology leader in the Robotaxi sector, benefiting from gradual policy openings, continuous breakthroughs in autonomous driving technology, and cost reductions in the supply chain. Revenue projections for 2025-2027 are 555 million, 945 million, and 1.987 billion yuan, respectively, with corresponding price-to-sales ratios of 43.0, 25.2, and 12.0 times [7]. - The company has established a solid equity structure with significant investments from industry players, including Nvidia and major automotive companies, totaling over 1.1 billion USD in funding prior to its IPO [56][57]. - The governance structure is stable, with the founder holding a significant voting power, ensuring effective decision-making [60]. Financial Analysis - The company is currently in an investment phase, with L2+ and L4 businesses contributing to revenue. The service revenue has seen rapid growth due to partnerships, while product revenue is derived from various autonomous vehicle models [66]. - Cash reserves are robust, with 5.4 billion yuan available as of Q3 2025, bolstered by successful financing rounds, ensuring liquidity and supporting ongoing operations [71]. Technological Core - WeRide's competitive edge is built on its self-developed technology stack, including the WeRideOne platform, which integrates advanced driving algorithms and a comprehensive sensor suite for enhanced safety and operational efficiency [75].
卓驭创始人沈劭劼:2026,智驾要从“端到端” 到“端到所有地方”
Xin Lang Cai Jing· 2026-01-11 05:53
Core Insights - The autonomous driving industry is experiencing significant turbulence, with companies like Maomao Zhixing facing collapse despite strong backing and funding, while others like Zhuoyu Technology secure substantial investments [2] - The competitive landscape has shifted from rule-driven to data-driven models, emphasizing the importance of rapid iteration and efficiency in development cycles [3][4] Company Developments - Zhuoyu Technology announced a strategic investment exceeding 3.6 billion yuan from China FAW, highlighting its growth amidst industry challenges [2] - The founder of Zhuoyu, Shen Shaojie, noted that the company's model iteration cycle has been reduced to weekly updates, significantly improving project delivery times from six months to just over one month [3] Industry Trends - Companies that fail to transition to a data-driven development paradigm are at risk of being eliminated from the market [4][5] - The core competitive factor in the intelligent driving sector is the ability to integrate data-driven approaches with traditional manufacturing processes [5] Transformation Challenges - Transitioning to a data-driven model has been challenging for teams traditionally focused on rule-based systems, as exemplified by Zhuoyu's decision to delete its original codebase [6] - The company has shifted its safety protocols from relying on numerous rules to a comprehensive evaluation system, emphasizing data quality over quantity [6] Engineering and Operational Changes - The integration of data-driven methodologies into all aspects of operations is crucial for the success of intelligent driving solutions [7] - Zhuoyu's engineering processes have evolved, with a focus on maintaining a disciplined approach to problem-solving without adding rules that could complicate models [10] Future Outlook - The competition in the intelligent driving industry is expected to intensify, with significant breakthroughs anticipated in 2026 [10][11] - Zhuoyu aims to expand its technology across various vehicle models and scenarios, leveraging a "base model" strategy that allows for customization by automotive manufacturers [13]
空间智能终极挑战MMSI-Video-Bench来了
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the launch of the MMSI-Video-Bench, a comprehensive benchmark for evaluating spatial intelligence in multimodal large language models (MLLMs), emphasizing the need for models to understand and interact with complex real-world environments [1][5][25]. Group 1: Benchmark Features - MMSI-Video-Bench is designed with a systematic approach to assess models' spatial perception capabilities, focusing on spatial construction and motion understanding [5][6]. - The benchmark evaluates high-level decision-making abilities based on spatiotemporal information, including memory update and multi-view integration [6][7]. - It consists of five main task types and 13 subcategories, covering planning and prediction capabilities [9]. Group 2: Model Performance - The benchmark revealed that even the best-performing model, Gemini 3 Pro, achieved only 38% accuracy, indicating a significant performance gap of nearly 60% compared to human levels [10][14]. - The evaluation highlighted deficiencies in models' spatial construction, motion understanding, planning, and prediction capabilities [14][16]. - Detailed error analysis identified five main types of errors affecting model performance, including detailed grounding errors and geometric reasoning errors [16][20]. Group 3: Data Sources and Evaluation - The video data for MMSI-Video-Bench is sourced from 25 public datasets and one self-built dataset, encompassing various real-world scenarios [11]. - The benchmark allows for targeted assessments of specific capabilities in indoor scene perception, robotics, and grounding [11]. Group 4: Future Directions - The article suggests that introducing 3D spatial cues could enhance model understanding and reasoning capabilities [21][26]. - It emphasizes the ongoing challenge of designing models that can effectively utilize spatial cues and highlights that current failures are rooted in fundamental reasoning limitations rather than a lack of explicit reasoning steps [26].
一个近300篇工作的综述!从“高层规划和低层控制”来看Manipulation任务的发展
具身智能之心· 2026-01-06 00:32
Core Insights - The article discusses the transformative advancements in robotic manipulation driven by the rapid development of visual, language, and multimodal learning, emphasizing the role of large foundation models in enhancing robots' perception and semantic representation capabilities [1][2]. Group 1: High-Level Planning - High-level planning is responsible for clarifying action intentions, organizing sequences, and allocating environmental attention, providing structured guidance for low-level execution [4]. - The core components of high-level planning include task decomposition and decision guidance, integrating multimodal information to address "what to do" and "in what order" [4]. - Task planning based on large language models (LLMs) maps natural language to task steps, with methods like SayCan and Grounded Decoding enhancing execution skill selection and planning capabilities [5]. - Multimodal large language models (MLLMs) break the limitations of pure text input by integrating visual and language reasoning, with models like PaLM-E and VILA demonstrating superior performance in embodied tasks [8]. - Code generation techniques convert planning into executable programs, improving the precision of language-based plans through methods like Code as Policies and Demo2Code [9]. - Motion planning utilizes LLMs and VLMs to generate continuous motion targets, linking high-level reasoning with low-level trajectory optimization [10]. - Usability learning focuses on establishing intrinsic associations between perception and action across geometric, visual, semantic, and multimodal dimensions [11]. - 3D scene representation transforms environmental perception into structured action proposals, bridging perception and action through techniques like Gaussian splatting [12]. Group 2: Low-Level Learning Control - Low-level control translates high-level planning into precise physical actions, addressing the "how to do" aspect of robotic manipulation [14]. - Learning strategies for skill acquisition are categorized into three main types, including pre-training and model-free reinforcement learning [16]. - Input modeling defines how robots perceive the world, emphasizing the integration of multimodal signals through reinforcement learning and imitation learning [18]. - Visual-action models utilize both 2D and 3D visual inputs to enhance action generation, while visual-language-action models integrate semantic, spatial, and temporal information [19]. - Additional modalities like tactile and auditory signals improve robustness in contact-rich manipulation scenarios [20]. Group 3: Challenges and Future Directions - Despite significant technological advancements, robotic manipulation faces four core challenges: the lack of universal architectures, data and simulation bottlenecks, insufficient multimodal physical interaction, and safety and collaboration issues [23][27][28][29]. - Future research directions include developing a "robotic brain" for flexible modal interfaces, establishing autonomous data collection mechanisms, enhancing multimodal physical interaction, and ensuring safety in human-robot collaboration [30]. - The review emphasizes the need for a unified framework that integrates high-level planning and low-level control, with a focus on overcoming data efficiency, physical interaction, and safety collaboration bottlenecks to facilitate the transition of robotic manipulation from laboratory settings to real-world applications [31].
空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心· 2026-01-05 08:54
Core Insights - The article discusses the importance of spatial understanding capabilities in multimodal large language models (MLLMs) for their transition into real-world applications as "general intelligent assistants" [2] - It highlights the limitations of existing spatial intelligence evaluation benchmarks, which either rely heavily on template generation or focus on specific spatial tasks, making it difficult to comprehensively assess models' spatial understanding and reasoning abilities in real-world scenarios [2] Group 1: Introduction of MMSI-Video-Bench - The Shanghai Artificial Intelligence Laboratory's InternRobotics team has launched a comprehensive and rigorous spatial intelligence video benchmark called MMSI-Video-Bench, designed to challenge current mainstream multimodal models [2][6] - The benchmark aims to evaluate models' spatial perception, reasoning, and decision-making capabilities in complex and dynamic real-world environments [2][7] Group 2: Benchmark Characteristics - MMSI-Video-Bench features a systematic design of question types that assess models' basic spatial perception abilities based on spatiotemporal information [6] - It includes high-level decision-making evaluations and extends task categories to cover complex real-world scenarios, testing models' cross-video reasoning capabilities, memory update abilities, and multi-view integration [6][8] - The benchmark consists of five major task types and 13 subcategories, ensuring a comprehensive evaluation of spatial intelligence [10] Group 3: Challenge and Performance - The benchmark's questions are designed to be highly challenging, with all models tested, including the best-performing Gemini 3 Pro, achieving only a 38% accuracy rate, indicating a significant performance gap of approximately 60% compared to human levels [10][14] - The evaluation reveals that models struggle with spatial construction, motion understanding, planning, prediction, and cross-video reasoning, highlighting critical bottlenecks in their capabilities [14][15] Group 4: Error Analysis - The research team identified five main types of errors affecting model performance: detailed grounding errors, ID mapping errors, latent logical inference errors, prompt alignment errors, and geometric reasoning errors [17][21] - Geometric reasoning errors were found to be the most prevalent, significantly impacting performance, particularly in spatial construction tasks [19][21] Group 5: Future Directions - The article suggests that introducing 3D spatial cues could assist models in understanding spatial relationships better, indicating a potential direction for future research [22][24] - It emphasizes the need for effective design of spatial cues that models can truly understand and utilize, as current failures are attributed to underlying reasoning capabilities rather than a lack of explicit reasoning steps [27]