多模态大语言模型
Search documents
研究显示多模态大模型可自发形成类人的物体概念表征
news flash· 2025-06-09 10:40
Core Insights - The research team from the Institute of Automation at the Chinese Academy of Sciences has confirmed that multimodal large language models (MLLMs) can spontaneously form object concept representation systems that are highly similar to those of humans [1] - This study opens new pathways for cognitive science in artificial intelligence and provides a theoretical framework for constructing human-like cognitive structures in AI systems [1] - The research findings were published in the international academic journal "Nature Machine Intelligence" on June 9 [1]
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
字节跳动&清华大学开源多模态时序大模型ChatTS,可实现时序数据对话与推理
机器之心· 2025-05-22 10:25
Core Viewpoint - The article discusses the development of ChatTS, a multimodal large language model (LLM) designed to support multivariate time series question answering and reasoning, addressing the limitations of existing models in handling time series data [1][6][14]. Group 1: Background and Motivation - The rapid advancement of multimodal LLMs has led to breakthroughs in various fields, but research on time series data integration remains limited [1][6]. - Existing attempts, such as TimeLLM, primarily focus on predictive tasks, failing to meet the complex understanding and reasoning needs in applications like AIOps and finance [1][6]. - There is a growing demand for LLMs that can handle time series data natively, enabling them to understand the shapes, fluctuations, and semantic meanings of time series [6][11]. Group 2: Challenges in Time Series Modeling - Traditional time series analysis methods often rely on statistical or AI models that require extensive task-specific training and structured input/output, lacking generalizability and interpretability [6][11]. - Current LLMs cannot directly process raw time series data, leading to limitations in existing approaches that convert time series into text or images [12][13]. - The scarcity of aligned time series and text data, along with the structural complexity of time series, poses significant challenges for model training and evaluation [11][12]. Group 3: ChatTS Development - ChatTS employs a "purely synthetic-driven" approach to overcome the lack of labeled data, creating an end-to-end data generation and model training framework [15]. - A detailed attribute system for time series is defined, ensuring the generated time series are diverse and accurately correspond to natural language descriptions [18]. - The model architecture is based on Qwen2.5-14B-Instruct, designed to natively perceive time series data by segmenting it into small patches and embedding it into the text context [22][23]. Group 4: Performance Evaluation - ChatTS has been evaluated using three datasets covering real-world and synthetic time series data, assessing alignment and reasoning tasks across 12 subcategories [31]. - In alignment tasks, ChatTS significantly outperformed baseline models, achieving F1 score improvements of 46% to 75% and over 80% accuracy in numerical tasks [32][33]. - For reasoning tasks, ChatTS demonstrated an average improvement of 25.8% over baseline models, showcasing its enhanced understanding capabilities [34]. Group 5: Future Potential - ChatTS represents a new paradigm in training multimodal models with synthetic data, indicating high potential for future applications in causal reasoning and root cause analysis [35].
ICML 2025 Spotlight | 多模态大模型暴露短板?EMMA基准深度揭秘多模态推理能力
机器之心· 2025-05-20 04:58
「三个点电荷 + Q、-2Q 和 + 3Q 等距放置,哪个向量最能描述作用在 + Q 电荷上的净电力方向?」 在解这道题时,我们可以通过绘制受力分析草图轻松解决。但即使是先进的多模态大语言模型,如 GPT-4o,也可能在理解「同性相斥」的基本物理原则时,错误 地判断斥力的方向(例如,错误地将 + 3Q 对 + Q 的斥力方向判断为右下方而非正确的左上方)。 这个看似简单的物理问题,却暴露了多模态大模型一个「致命缺陷」: 当前的 MLLMs 仍然无法进行需要深度视觉与文本融合的复杂多模态推理 !一项最新研究 推出的 EMMA 基准测试,如同一面「照妖镜」,揭示了即使是顶尖 MLLMs 也在这关键能力上显著不足。 目前该研究已被 ICML 2025 接收为 spotlight,代码数据已全部开源 ! 目前已有多个模型 / 方法在 EMMA 上验证其多模态推理能力,研究发现: 即使最先进的模型 ——Gemini-2.5-pro-exp-03-25 ,或者是能够进行视觉工具调用的 o3/o4-mini 模型在 EMMA 上的表现仍然落后人类专家超 20% ! 标题: Can MLLMs Reason in Multi ...
鹅厂放大招,混元图像2.0「边说边画」:描述完,图也生成好了
量子位· 2025-05-16 03:39
Core Viewpoint - Tencent has launched the Hunyuan Image 2.0 model, which enables real-time image generation with millisecond response times, allowing users to create images seamlessly while describing them verbally or through sketches [1][6]. Group 1: Features of Hunyuan Image 2.0 - The model supports real-time drawing boards where users can sketch elements and provide text descriptions for immediate image generation [3][29]. - It offers various input methods, including text prompts, voice input in both Chinese and English, and the ability to upload reference images for enhanced image generation [18][19]. - Users can optimize generated images by adjusting parameters such as reference image strength and can also use a feature to automatically enhance composition and depth [27][35]. Group 2: Technical Highlights - Hunyuan Image 2.0 features a significantly larger model size, increasing parameters by an order of magnitude compared to its predecessor, Hunyuan DiT, which enhances performance [37]. - The model incorporates a high-compression image codec developed by Tencent, which reduces encoding sequence length and speeds up image generation while maintaining quality [38]. - It utilizes a multimodal large language model (MLLM) as a text encoder, improving semantic understanding and matching capabilities compared to traditional models [39][40]. - The model has undergone reinforcement learning training to enhance the realism of generated images, aligning them more closely with real-world requirements [41]. - Tencent has developed a self-research adversarial distillation scheme that allows for high-quality image generation with fewer steps [42]. Group 3: Future Developments - Tencent's team has indicated that more details will be shared in upcoming technical reports, including information about a native multimodal image generation model [43][45]. - The new model is expected to excel in multi-round image generation and real-time interactive experiences [46].
GPT-4o不敌Qwen,无一模型及格!UC伯克利/港大等联合团队提出多模态新基准:考察多视图理解能力
量子位· 2025-05-14 06:07
Core Insights - The article discusses the introduction of the All-Angles Bench, a new benchmark for evaluating multi-view understanding capabilities of multi-modal large language models (MLLMs) [2][4]. Group 1: Overview of All-Angles Bench - All-Angles Bench aims to comprehensively assess the multi-view understanding abilities of MLLMs, featuring over 2,100 manually annotated multi-view question-answer pairs across 90 real-world scenarios [2][8]. - The benchmark includes six challenging tasks: Counting, Attribute Identification, Relative Distance, Relative Direction, Object Manipulation, and Camera Pose Estimation, which evaluate the models' understanding of 3D scenes [8][9]. Group 2: Performance Evaluation - A total of 27 leading MLLMs were benchmarked, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o, revealing a significant gap between their performance and human-level understanding [4][14]. - In the Camera Pose Estimation task, human annotators achieved an accuracy of 88.9%, while top models like Gemini-2.0-Flash lagged behind by over 50% [16]. Group 3: Findings and Analysis - Certain open-source models, such as Ovis2-34B and Qwen2.5-VL-72B, outperformed closed-source models in direction-sensitive tasks, likely due to their superior video understanding and visual localization capabilities [17]. - The analysis revealed inconsistencies in MLLMs' responses, particularly in tasks involving relative direction, indicating challenges in multi-view understanding [20][23]. - MLLMs struggled with integrating fragmented information across views, often miscounting objects when visibility was partial [24][31]. Group 4: Recommendations for Improvement - The article suggests that merely optimizing prompts is insufficient for enhancing multi-view understanding; dedicated multi-view training is necessary for substantial performance improvements [32].
推出金融交易AI Agent,可全天候智能盯盘,这家新加坡金融企业获1000万美元融资|早起看早期
36氪· 2025-05-12 23:56
Core Viewpoint - RockFlow, a Singapore-based AI fintech company, has completed a $10 million A1 funding round to enhance its AI technology and launch its financial AI agent, Bobby [3][4]. Group 1: Company Overview - RockFlow operates five offices globally and covers over 30 countries in nine languages, previously receiving tens of millions in investments from top-tier Silicon Valley funds [4]. - The company launched TradeGPT, the world's first trading AI product, in April 2023, which utilizes multimodal LLM capabilities to analyze vast market information and price-volume data [4]. Group 2: Product Development - RockFlow is developing an AI agent architecture tailored for financial investment scenarios, leveraging cutting-edge technologies such as multimodal large language models (LLM), Fin-Tuning, RAG, Multi-Agent, and CoT [4][5]. - The AI agent aims to enhance understanding and generation capabilities, efficiently process multi-source data, and provide precise financial analysis and investment recommendations [4][5]. Group 3: Investment Process - In investment trading scenarios, RockFlow's AI agent simplifies traditional complex processes into four core steps: real-time information acquisition, analysis, trading strategy construction, and order execution [5]. - The AI agent monitors market dynamics and analyzes extensive data, including financial metrics and social media sentiment, to present personalized real-time trading opportunities [5][6]. Group 4: User Interaction - Users can express their needs in natural language, allowing the AI agent to generate personalized investment configurations and trading strategies based on their profit goals and risk preferences [6]. - The AI agent can also create complex conditional orders and automate investment tasks, assisting users in managing profits and losses effectively [6]. Group 5: Future Outlook - Bobby, the financial AI agent product, is set to launch globally soon, with a team comprising experts from AI, financial mathematics, and investment trading [6].
理想汽车MCAF重构辅助驾驶视觉认知新范式
理想TOP2· 2025-04-25 12:43
以下文章来源于AcademicDaily ,作者AcademicDaily AcademicDaily . AcademicDaily是一个跟踪、推荐和解读大模型等AI成果的技术交流平台,致力于传播和分享前沿技术。 MCAF在理想内部被称为自动驾驶第三只眼。 兼容理想自研的Mind GPT-3o 与 BEV 大模型,无需重新训练。 MCAF是一个 多模态粗到细注意力聚焦框架,核心解决的是长视频理解的关键瓶颈。 当前视频理解领域对长视频(>5分钟)的处理存在显著缺陷,主流方法(如Video-MLLM)依赖全局压缩或均匀采样,导致细 节丢失和冗余计算。MCAF直接针对这一问题,通过多模态分层注意力和时间扩展机制,在信息保留与计算效率之间找到了平 衡点,这是其核心价值。 在平均时长达60分钟的Video-MME数据集上,MCAF超越其他代理方法(如VideoTree、DrVideo)约3-5个百分点。 不同于VideoTree等需要额外奖励模型评估置信度,MCAF利用单一LLM完成生成-评估-调整闭环。这不仅简化了架构(如代码 实现仅需1个LLM接口),还避免了多模型协同的兼容性问题,更适合实际部署。 不过在NEx ...
10倍吞吐提升无损性能:多模态适用的KV cache量化策略来了,即插即用无需改原模型
量子位· 2025-04-03 02:12
Core Insights - The article discusses the introduction of CalibQuant, a 1-bit KV cache quantization method for multimodal large language models (LLMs), which significantly enhances throughput while maintaining model performance [1][5][18]. Group 1: Motivation and Challenges - Current multimodal LLMs face challenges in handling large, high-resolution image or video data, where the KV cache mechanism increases memory usage proportionally to input length, limiting throughput [6]. - Existing quantization methods for LLM KV caches do not adequately address the unique visual redundancy in multimodal contexts, making them ineffective under extreme conditions [6][7]. Group 2: Methodology - CalibQuant employs a novel 1-bit quantization strategy that integrates post-scaling and calibration techniques to reduce memory and computational costs without altering the original model [3][5]. - The method includes channel-wise quantization, which refines the statistical range for quantization, thus preserving model performance better than global statistics [9][10]. - A post-scaling management strategy is introduced to optimize the computation order during dequantization, enhancing efficiency and reducing storage needs [11][12]. - A calibration method is proposed to adjust attention scores before softmax, mitigating the impact of extreme values resulting from 1-bit quantization [13][14]. Group 3: Experimental Results - The proposed quantization method was tested on LLaVA and InternVL models across various tasks, showing superior performance compared to existing methods like KIVI and VLCache, particularly in the captioning task [15][18]. - For instance, the method achieved a CIDEr score of 1.109 at 1-bit quantization for the llava-1.5-7b model, surpassing VLCache's score of 1.053 [15]. Group 4: Runtime Analysis - The runtime analysis demonstrated that the 1-bit quantization method consistently outperformed the 16-bit baseline in throughput across different memory budgets, achieving up to 459.016 tokens per second compared to the baseline's 40.816 tokens per second [17]. - This indicates a throughput improvement of approximately 9.88× to 11.24×, showcasing the method's effectiveness under constrained memory conditions [17]. Group 5: Conclusion - The article concludes that the proposed CalibQuant method effectively addresses the challenges of KV cache compression in multimodal LLMs, enhancing both computational efficiency and model performance [18].
长视频理解新突破!Mamba混合架构让显存消耗腰斩,处理10万视频token不费力
量子位· 2025-03-27 04:16
Core Viewpoint - The article introduces the Vamba model, a hybrid Mamba-Transformer model designed for efficient understanding of long videos, significantly improving processing efficiency without compressing video tokens [1][10]. Group 1: Model Design and Efficiency - Vamba improves the efficiency of processing video tokens during training and inference by redesigning the model architecture rather than compressing video tokens [1][4]. - The model can process four times more video frames under the same hardware conditions compared to traditional Transformer architectures, with over 50% reduction in training memory consumption and doubled training speed [4][9]. - Vamba retains the original spatiotemporal features of videos, avoiding information loss that occurs with traditional downsampling or pooling methods [5][10]. Group 2: Technical Innovations - The core design of Vamba involves breaking down the costly causal self-attention operations into two more efficient components: cross-attention for text tokens and a state space model (SSM) based Mamba-2 module for video tokens [6][7]. - The Mamba-2 module reduces the computational complexity from quadratic to linear, allowing for effective processing of long video sequences [7][9]. - Vamba's architecture allows for efficient alignment of text and video information, enhancing the model's ability to analyze video content based on user queries [9][10]. Group 3: Performance Evaluation - Extensive experiments show that Vamba outperforms existing efficient long video understanding models by approximately 4.3% on the LVBench benchmark [5][10]. - The model demonstrates superior performance across various video duration benchmarks, showcasing its competitive edge in long, medium, and short video understanding tasks [10].