多模态大模型
Search documents
美团发布原生多模态大模型 LongCat-Nex
Bei Jing Shang Bao· 2026-03-27 15:19
Core Viewpoint - Meituan has released and fully open-sourced its native multimodal model LongCat-Next, along with its core component, the discrete native resolution visual tokenizer (dNaViT), which shifts the focus from a language-centric architecture to a unified mapping of images, speech, and text into discrete tokens [1] Group 1 - The LongCat-Next model breaks the traditional piecemeal architecture of large models by integrating visual and auditory data as the "native language" of AI [1] - The model employs a pure "next token prediction" paradigm, enhancing the interaction between visual and auditory inputs [1]
美团开源原生多模态大模型LongCat-Next
Xin Lang Cai Jing· 2026-03-27 04:01
Core Viewpoint - Meituan has launched and fully open-sourced its native multimodal model LongCat-Next, which breaks the traditional language-centric architecture of large models by unifying images, speech, and text into a common discrete token format [1] Group 1 - The LongCat-Next model utilizes a pure "next token prediction" paradigm, allowing visual and auditory inputs to become the "native language" of AI [1] - This development represents a significant step by the Meituan LongCat team towards achieving AI that interacts with the physical world [1]
美团发布原生多模态大模型LongCat-Next
Xin Lang Cai Jing· 2026-03-27 03:44
Core Insights - Meituan has released and fully open-sourced its native multimodal large model LongCat-Next along with its core component, the discrete native resolution visual tokenizer (dNaViT) [1][2] - The model breaks the traditional language-centric architecture of current large models by unifying images, speech, and text into a common discrete token representation [1][2] - LongCat-Next employs a pure "Next Token Prediction" (NTP) paradigm, allowing vision and speech to become the "native language" of AI [1][2] Technical Breakthroughs - The discrete native autoregressive architecture (DiNA) completely breaks the modality barrier [1][2] - The discrete native resolution visual tokenizer (dNaViT) constructs a "dictionary" for the visual world [1][2] - The semantic alignment complete encoder addresses the industry challenge of "information loss due to discretization" [1][2]
让大模型基于「图像事实」说话:用事实文本+自适应编辑,让语言偏见无处遁形丨ICLR'26
量子位· 2026-03-26 07:34
Core Insights - The article discusses the challenges of object hallucination in large visual language models (LVLM), where models may generate incorrect or non-existent objects based on language bias rather than visual evidence [4][6] - A new framework called AFTER (Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation) is introduced, which aims to reduce hallucinations while maintaining low inference costs [6][19] Group 1: AFTER Framework - AFTER consists of two main modules: Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO) [9][10] - FAS extracts factual information from ground-truth annotations to create a reliable textual description that guides the model's activation editing [9][10] - QAO adapts the editing process based on the specific question asked, allowing for more precise adjustments to the model's output [10][11] Group 2: Experimental Results - The AFTER framework significantly outperforms existing methods in reducing hallucinations while incurring minimal additional inference costs [12][15] - In various evaluations, AFTER achieved an average increase of +130.7 in overall performance metrics across three LVLMs, indicating enhanced visual alignment and reliability [15][19] - The model operates efficiently at a speed of 29.7 tokens/s with moderate memory usage of approximately 16.3GB [17][19] Group 3: Implications and Future Directions - AFTER provides a practical approach to mitigating hallucinations without the need for retraining or fine-tuning the main model, making deployment more manageable [19][20] - The framework explicitly addresses language bias through factual semantics, offering a more direct solution compared to traditional visual perturbation methods [19] - Future developments may focus on enhancing domain-specific visual perception and bias mitigation, particularly in specialized fields like healthcare [19]
A股头条:中远海运恢复海湾国家订舱,船舶暂不过霍尔木兹海峡;特斯拉机器人最新视频曝光,马斯克称有望在2027年实现量产
Sou Hu Cai Jing· 2026-03-25 23:54
Group 1 - The National Supercomputing Internet has launched a new token giveaway, offering users a maximum of 30 million tokens to lower the entry barrier for AI applications [1] - The Ministry of Commerce of China has initiated an investigation into Mexico's increased import tariffs on products from non-free trade partners, identifying these measures as trade investment barriers [1] Group 2 - COSCO Shipping has resumed booking services to Gulf countries, although vessels will not pass through the Strait of Hormuz, instead using land transport for deliveries [2] - Mercedes-Benz has announced the integration of a multimodal large model developed in collaboration with Tsinghua University into its new generation S-Class vehicles, marking a significant step in automotive intelligence [3] Group 3 - Tesla has revealed new details about its Optimus robot, with plans for mass production by 2027, which could significantly impact labor and manufacturing economics [5] - The U.S. stock market has shown positive movement, with major indices rising, driven by technology stocks and a general optimism regarding Middle East peace negotiations [6][7] Group 4 - Shenzhen has launched an action plan to accelerate the high-quality development of the AI server industry, aiming for significant growth in production capacity and market share by 2028 [12][13] - The State-owned Assets Supervision and Administration Commission is promoting the digital transformation of financial management in central enterprises, leveraging new technologies like big data and AI [14] Group 5 - Several companies have reported significant profit growth for 2025, including Huagong Technology with a 20% increase and Yuloka with a 92.73% increase, indicating strong performance in their respective sectors [16]
浙大团队破解多模态模型「盲目自信」:先校准置信度,再分配算力丨CVPR'26
量子位· 2026-03-22 04:18
Core Insights - The article discusses the issue of "blind confidence" in multimodal large models, where models maintain high confidence levels even when visual input quality deteriorates significantly, leading to hallucinations and misjudgments [2][4][6] - A new framework called CA-TTS (Confidence-Aware Test-Time Scaling) is proposed to address this issue by calibrating the model's self-assessment capabilities through confidence-driven reinforcement learning [4][15] Group 1: Problem Identification - A study conducted by a research team from Zhejiang University, Alibaba, City University of Hong Kong, and the University of Michigan revealed that as image quality degrades, model accuracy drops sharply while confidence remains unchanged [2][4] - This phenomenon is termed "perceptual bluntness," indicating a lack of sensitivity to changes in visual information quality [7][9] Group 2: Proposed Solutions - The training phase employs a method called CDRL (Confidence-Driven Reinforcement Learning) to align visual perception with confidence levels, encouraging models to differentiate between clear and unclear visual inputs [9][10] - CDRL utilizes a dual reward mechanism: one for encouraging sensitivity to visual degradation and another for maintaining honesty in self-assessment [11][12] Group 3: Performance Improvements - The implementation of CA-TTS resulted in significant performance improvements across four mainstream visual reasoning benchmarks, with an average increase of 8.8% over existing methods [4][19] - In the Math-Vision benchmark, accuracy improved from 23.0% to 42.4%, nearly doubling the baseline performance [19] Group 4: Methodology and Results - The CA-TTS framework consists of three modules: Self-Consistency, Self-Reflection, and Self-Check, which work together to enhance decision-making during inference [15][17] - Experimental results indicate that CA-TTS outperforms traditional methods like Majority Voting and DeepConf in terms of accuracy and efficiency, with a scaling efficiency that is 2.2 times and 3.1 times higher, respectively [27][28] Group 5: Theoretical Implications - The research shifts the paradigm from "reasoning first, perception second" to "perception first, reasoning second," emphasizing the importance of reliable perception in complex reasoning tasks [29][30] - This approach aims to ensure that multimodal large models can accurately assess their confidence levels, particularly in high-risk scenarios [29][30]
五百行代码打造SOTA视觉智能体!UniPat AI最新开源
量子位· 2026-03-16 07:14
Core Insights - The article discusses the impressive advancements in multimodal large models' coding capabilities, while highlighting their frequent errors in basic visual tasks [1][2] - UniPat AI's SWE-Vision framework allows models to write and execute Python code to enhance their visual judgment accuracy, achieving state-of-the-art results across five major visual benchmarks [1][5] Group 1: Model Performance and Limitations - Multimodal large models have shown remarkable progress in coding, comparable to experienced engineers, but struggle with understanding the visual world accurately [2][3] - The BabyVision benchmark revealed that models often provide seemingly reasonable reasoning but fail in basic measurements, counting, and spatial relationship judgments [2][3] Group 2: SWE-Vision Framework - SWE-Vision is a minimalist visual intelligence framework that enables models to utilize coding as a tool to compensate for visual processing inaccuracies [3][6] - The framework includes a simple tool layer with only two functions: execute_code for running Python in a persistent Jupyter environment and finish for outputting final answers [7][8] Group 3: Execution and Iteration - SWE-Vision operates through a standard agentic loop, allowing the model to organize user queries and images, execute code, and iterate based on results until a final answer is reached [9][15] - The persistent Jupyter kernel allows for state retention across multiple calls, enabling step-by-step analysis similar to human analysts [11][18] Group 4: Results and Implications - SWE-Vision achieved significant improvements over leading visual language models, with notable scores in various benchmarks: 64.4 in BabyVision, 94.0 in MathVision, 50.1 in Zero-Bench-Sub, 69.0 in OmniSpatial, and 82.5 in CharXiv-RQ [5][22] - The framework demonstrates that introducing coding capabilities can systematically elevate the visual performance of advanced models, particularly in basic perception and precise processing tasks [20][28] Group 5: Future Directions - Future developments aim to integrate coding as an inherent capability of visual intelligence agents, enhancing their ability to perceive, act, and reflect [30][31] - Key areas for improvement include recognizing when visual reasoning requires code assistance, validating intermediate results, and seamlessly merging observation with computation [32]
UniPat AI开源SWE-Vision:五百行代码打造SOTA视觉智能体!
机器之心· 2026-03-16 01:31
Core Insights - The article discusses the impressive advancements in multimodal large models' coding capabilities, while highlighting their frequent errors in basic visual tasks. UniPat AI has developed a minimalist visual intelligence framework called SWE-Vision, which allows models to write and execute Python code to process and validate their visual judgments. SWE-Vision has achieved state-of-the-art results across five mainstream visual benchmark tests [1][3][9]. Group 1: Model Limitations and Observations - Multimodal large models have made significant progress in coding, comparable to experienced engineers, but struggle with understanding the visual world, often making errors in basic measurements, counting, and spatial relationships [3][4]. - The BabyVision benchmark revealed that models often provide seemingly reasonable reasoning but fail in fundamental visual processing tasks, indicating a gap in their capabilities [3][4]. - A key observation is that while models can "see," they often cannot process visual information accurately, prompting the idea of using coding as a tool to enhance visual processing precision [5][7]. Group 2: SWE-Vision Framework - SWE-Vision is designed as a minimalist visual intelligence agent, focusing on two main tools: execute_code and finish, allowing models to utilize familiar programming actions without overwhelming them with specialized visual APIs [10][11][12]. - The framework includes a standard agentic loop that enables the model to organize user queries and images, execute code, and return results for further decision-making [13][16]. - SWE-Vision operates in a persistent Jupyter environment, allowing for state retention across multiple code executions, which facilitates a more human-like iterative analysis process [14][21]. Group 3: Performance and Results - SWE-Vision has shown remarkable improvements in five different visual benchmark tests, enhancing the performance of leading large language models (LLMs) such as GPT-5.2-xhigh and Seed-2.0-Pro [9][30]. - The results indicate that the introduction of code execution capabilities systematically elevates the visual performance limits of advanced models, particularly in basic perception and precise processing tasks [28][34]. - The framework's design allows for multi-step analysis and verification, contrasting with traditional models that rely on intuitive observation [24][25]. Group 4: Future Directions - The article suggests that future developments should focus on integrating "code-enhanced vision" as a native capability of visual intelligence agents, requiring a shift towards interactive environments that support reinforcement learning and tool usage [36][37]. - Key future directions include learning to identify when visual reasoning requires code assistance, actively verifying intermediate results, and seamlessly integrating observation with computation [39][40].
ICLR 2026|多模态大模型真的理解情绪吗?MME-Emotion给出了系统答案
机器之心· 2026-03-15 01:20
Core Viewpoint - Multimodal Large Language Models (MLLMs) are rapidly transforming AI capabilities, particularly in understanding human emotions through various modalities [2][3]. Group 1: MME-Emotion Benchmark - MME-Emotion is a comprehensive evaluation benchmark for emotional intelligence in MLLMs, developed by a team from The Chinese University of Hong Kong and Alibaba's Tongyi Laboratory, and accepted at ICLR 2026 [3]. - It is one of the largest multimodal emotional intelligence evaluation benchmarks, containing approximately 6,500 video segments and corresponding Q&A data, covering 27 real-world scenarios and designed with 8 different emotional tasks [5]. - The benchmark emphasizes the integration of multimodal information in real environments, requiring models to understand visual, auditory, and linguistic information simultaneously [5]. Group 2: Evaluation Tasks and Metrics - The tasks include laboratory emotion recognition, real-world emotion recognition, noise condition emotion recognition, fine-grained emotion recognition, multi-label emotion recognition, sentiment analysis, fine-grained sentiment analysis, and intent recognition [8]. - MME-Emotion evaluates both emotion recognition and reasoning capabilities, distinguishing between merely guessing the correct emotion label and genuinely understanding the underlying emotional cues [8]. - A unified evaluation metric system is proposed, including Recognition Score, Reasoning Score, and Chain-of-Thought Score, to assess the accuracy of emotion predictions, the rationality of reasoning processes, and the overall performance [10]. Group 3: Model Performance and Challenges - The evaluation of 20 mainstream multimodal models revealed that even the best-performing models scored below 40% in emotion recognition and around 56% in Chain-of-Thought Score, indicating significant shortcomings in emotional intelligence [13]. - Key issues identified include insufficient fine-grained visual understanding, limited multimodal information fusion capabilities, and a correlation between reasoning ability and emotion recognition performance [14][15][16]. - The findings suggest that enhancing models' reasoning processes may be a crucial pathway to improving emotional intelligence [16]. Group 4: Future Directions - Future advancements in multimodal emotional intelligence may rely on higher precision in visual detail modeling, more effective methods for fusing auditory and visual information, and reasoning mechanisms that can explain the causes of emotions [16]. - The release of MME-Emotion provides a unified evaluation standard and a clear reference baseline for subsequent model improvements [17].
复旦等推出「第一人称视听基准」,补齐多模态模型「听觉拼图」
量子位· 2026-03-12 02:59
Core Viewpoint - The article discusses the limitations of current multimodal models in understanding sound in egocentric videos, emphasizing the need for models to not only "see" but also "hear" and comprehend the context of sounds in real-world scenarios [1][2][3]. Group 1: Introduction to EgoSound - EgoSound is introduced as the first systematic benchmark for evaluating sound understanding in egocentric videos, developed by a research team from multiple universities [5][6]. - The goal of EgoSound is to enable models to hear, understand, reason, and explain events occurring in the real world [6][7]. Group 2: Benchmark Contributions - EgoSound integrates two complementary datasets: Ego4D, which covers a wide range of daily first-person activities, and EgoBlind, which focuses on scenarios that heavily rely on auditory understanding [9]. - The benchmark consists of seven task categories that cover the complete chain from perception to reasoning, addressing the limitations of previous models that primarily focused on visual information [10]. - A high-quality, large-scale OpenQA dataset was created, comprising 900 carefully selected videos and 7,315 validated open-ended questions, emphasizing the importance of auditory clues in the questions [11][12]. Group 3: Model Evaluation and Findings - The research team evaluated several state-of-the-art (SOTA) multimodal large language models (MLLMs) and provided a systematic analysis to guide future research [13]. - The evaluation revealed a significant gap between human performance (83.9% accuracy) and the best-performing model (56.7% accuracy), indicating that current models struggle to reliably convert sound into meaningful cognition [17][18]. - Key findings highlighted that spatial, temporal, and causal reasoning are the most challenging aspects for models, which often fail to answer questions about the source, timing, and reasoning behind sounds [20]. Group 4: Challenges in Sound Reasoning - Cross-modal alignment remains a bottleneck, as sound clues frequently exist outside the visual frame, necessitating a chain of reasoning that connects hearing, seeing, and inferring [21]. - The complexity of real-world interactions, including occlusions, camera shake, and varying distances of sound sources, has been underestimated, making sound reasoning more challenging [22]. Group 5: Conclusion - The article concludes that while previous multimodal models acted as "visual narrators," EgoSound aims to transform them into true first-person agents capable of both seeing and hearing, thus enhancing their ability to describe, locate, explain, and infer in a non-silent real world [23].