多模态大语言模型

Search documents
快手团队发布8B Kwai Keye-VL!技术报告速递~
自动驾驶之心· 2025-07-07 12:17
Core Insights - The article discusses the launch of Kwai Keye-VL, an 8 billion parameter multimodal large language model (MLLM) designed to enhance understanding of short video content, addressing the limitations of existing models in processing dynamic and information-dense media [2][3]. Group 1: Model Development - Kwai Keye-VL is built on a large-scale dataset containing over 600 billion tokens, primarily focused on high-quality video data, and employs an innovative training strategy [2][4]. - The training process consists of a four-stage pre-training phase followed by a two-stage post-training phase, aimed at aligning visual and language features effectively [4][18]. Group 2: Training Methodology - The first stage of training focuses on optimizing basic capabilities such as instruction following through supervised fine-tuning and mixed preference optimization [5]. - The second stage enhances reasoning abilities using a five-mode "cold start" data mixing strategy, which includes various reasoning tasks and high-quality video data [6][12]. Group 3: Performance Evaluation - Keye-VL has demonstrated advanced performance in public benchmark tests, outperforming other leading models of similar size in user experience evaluations [3][27]. - The model's capabilities were validated through extensive evaluation experiments, including the development of a new benchmark, KC-MMBench, tailored for real-world short video scenarios [3][28]. Group 4: Technical Innovations - The model incorporates a hybrid parallelism strategy for efficient training, combining data and sequence parallelism to optimize memory usage and computational efficiency [22][23]. - A dynamic load balancing mechanism is implemented to address computational load imbalances during multimodal training, significantly improving training speed [24]. - A sample-level auto-resume mechanism enhances training stability by allowing automatic recovery from interruptions [25].
6大基准全面碾压!TW-GRPO刷新视频推理天花板,CLEVRER准确率突破50.4%!
机器人大讲堂· 2025-07-06 05:23
Core Viewpoint - The rapid development of multi-modal large language models (MLLMs) is significantly enhancing video reasoning capabilities, driven by reinforcement learning (RL) as a key engine for this technological revolution [1] Group 1: TW-GRPO Framework Introduction - The TW-GRPO framework is proposed to address challenges in reasoning quality and reward granularity in video reasoning tasks, inspired by the traditional GRPO framework [2] - TW-GRPO integrates focused thinking and multi-level soft reward mechanisms for multi-choice QA tasks [3] Group 2: Key Improvements in TW-GRPO - The framework enhances information weighting and reward mechanism design, applying a soft reward mechanism from video localization to video reasoning tasks [4] - A dynamic weighting mechanism prioritizes high information density tokens, improving reasoning accuracy and efficiency by focusing on key content [4] - The multi-level reward mechanism redefines rewards, allowing for partial correctness in answers, thus improving training stability and efficiency [5] Group 3: Data Augmentation and Training Efficiency - TW-GRPO introduces a question-answer inversion (QAI) data augmentation technique to convert single-choice tasks into multi-choice formats, effectively expanding the training data pool [6] - This approach disrupts traditional equal treatment of tokens, enhancing training efficiency and reasoning performance through differentiated information processing [6] Group 4: Experimental Validation - Extensive experiments demonstrate TW-GRPO's effectiveness in video reasoning and general understanding tasks, outperforming Video-R1 by 18.8%, 1.8%, and 1.6% in various benchmarks [12][15] - The framework shows faster convergence and more stable learning processes compared to traditional GRPO, with shorter output sequences indicating more efficient reasoning [11][17] Group 5: Qualitative Analysis of Reasoning Paths - A qualitative comparison of reasoning paths between T-GRPO and TW-GRPO illustrates significant improvements in accuracy and efficiency in dynamic visual cue reasoning tasks [22]
刚刚,CVPR 2025奖项出炉:牛津&Meta博士生王建元获最佳论文,谢赛宁摘年轻研究者奖
机器之心· 2025-06-13 15:45
Core Insights - The CVPR 2025 conference in Nashville, Tennessee, awarded five papers, including one best paper and four honorable mentions, along with one best student paper and one honorable mention for student papers [1][2]. Submission and Acceptance Statistics - This year, over 40,000 authors submitted 13,008 papers, marking a 13% increase from last year's 11,532 submissions. A total of 2,872 papers were accepted, resulting in an overall acceptance rate of approximately 22.1%. Among the accepted papers, 96 were oral presentations (3.3%) and 387 were highlighted (13.7%) [3][5]. Conference Attendance - The conference attracted over 9,000 attendees from more than 70 countries and regions [7]. Paper Acceptance by Field - The image and video generation field had the highest number of accepted papers, while the highest acceptance rates were seen in 3D based on multi-view and sensor data, as well as single-image 3D [8]. Best Paper Award - The best paper, titled "VGGT: Visual Geometry Grounded Transformer," was presented by researchers from the University of Oxford and Meta AI. It introduced a universal 3D vision model based on a pure feedforward Transformer architecture, capable of inferring core geometric information from one or more images [13][14]. Notable Research Contributions - The best paper demonstrated significant performance improvements over traditional optimization methods and existing state-of-the-art models in various 3D tasks, achieving inference speeds in seconds without requiring post-processing optimization [17]. Best Student Paper - The best student paper, "Neural Inverse Rendering from Propagating Light," proposed a physics-based multi-view dynamic light propagation neural inverse rendering system, achieving state-of-the-art 3D reconstruction under strong indirect lighting conditions [53][55]. Awards and Recognitions - Two Young Researcher Awards were given to Hao Su and Saining Xie for their outstanding contributions to computer vision research [68][72]. The Longuet-Higgins Award was presented to two papers that have significantly influenced the field, including the Inception architecture and fully convolutional networks for semantic segmentation [75][78][80].
科学家证实大模型能像人类一样“理解”事物
Ke Ji Ri Bao· 2025-06-10 22:45
Core Insights - Researchers from the Chinese Academy of Sciences have confirmed that multimodal large language models can learn to "understand" objects in a manner similar to humans, paving the way for future AI systems that can comprehend the world like humans do [1][2] Group 1: Research Findings - The study utilized a clever experiment based on human cognitive principles, where both a large model and humans played a "find the difference" game, analyzing data from 4.7 million judgments to create a "concept map" of the model's thinking [2] - The researchers identified 66 key perspectives on how AI "understands" objects, which align closely with the neural activity patterns in the human brain responsible for object processing [2] - The multimodal model's approach to "thinking" and making choices is found to be more similar to human cognition compared to other models [2] Group 2: Comparison with Human Understanding - While humans consider both the appearance and meaning of objects, the large model relies more on "text labels" and learned abstract concepts, indicating a development of a somewhat human-like understanding of the world [2]
中国科研团队研究发现:人工智能可以自发形成人类级认知
Xin Jing Bao· 2025-06-09 13:01
新京报讯(记者张璐)6月9日,记者从中国科学院自动化研究所获悉,科研人员结合行为实验与神经影 像分析,首次证实多模态大语言模型(MLLMs)能够自发形成与人类高度相似的物体概念表征系统。 相关研究成果发表于《自然·机器智能》。 人类能够对自然界中的物体进行概念化,这一认知能力长期以来被视为人类智能的核心。当我们看到 狗、汽车或苹果时,不仅能识别它们的物理特征,比如尺寸、颜色、形状等,还能理解其功能、情感价 值和文化意义,这种多维度的概念表征构成了人类认知的基石。 研究人员从海量大模型行为数据中提取出66个"心智维度",并为这些维度赋予了语义标签。研究发现, 这些维度是高度可解释的,且与大脑类别选择区域的神经活动模式显著相关。 研究还揭示了人类在做决策时更倾向于结合视觉特征和语义信息进行判断,而大模型则倾向于依赖语义 标签和抽象概念。研究表明,大语言模型内部存在着类似人类对现实世界概念的理解。 随着ChatGPT等大语言模型(LLMs)的发展,一个根本性问题浮出水面:这些大模型能否从语言和多 模态数据中发展出类似人类的物体概念表征? 近日,中国科学院自动化研究所神经计算与脑机交互(NeuBCI)课题组与中国科学 ...
人工智能可自发形成人类级认知?中国团队最新研究首次证实
Huan Qiu Wang Zi Xun· 2025-06-09 12:57
Core Insights - A Chinese scientific team has demonstrated that AI, specifically multimodal large language models, can form object concept representation systems similar to human cognition, indicating that AI can achieve human-level understanding [1][3][4] - The research was conducted by the Chinese Academy of Sciences and published in the journal "Nature Machine Intelligence," providing a theoretical framework for developing human-like cognitive structures in AI [1][3] Research Methodology - The study utilized a combination of cognitive neuroscience theories, computational modeling, behavioral experiments, and brain science to create an innovative research paradigm [3][4] - A classic cognitive psychology task was employed, where both AI models and humans were asked to identify the least similar option from a trio of object concepts derived from 1,854 everyday concepts [4] Findings - The research team analyzed 4.7 million behavioral judgment data points to construct a "concept map" for the AI model, revealing 66 "mental dimensions" that were semantically labeled and significantly correlated with neural activity patterns in the human brain [4] - The study found that multimodal large language models exhibited higher consistency with human behavior in decision-making tasks, although humans tended to integrate visual features and semantic information more than the AI models, which relied on semantic labels and abstract concepts [4] - The core finding suggests that AI's "mental dimensions" align closely with human understanding of reality, marking a significant advancement from mere "machine recognition" to "machine understanding" [4]
研究显示多模态大模型可自发形成类人的物体概念表征
news flash· 2025-06-09 10:40
Core Insights - The research team from the Institute of Automation at the Chinese Academy of Sciences has confirmed that multimodal large language models (MLLMs) can spontaneously form object concept representation systems that are highly similar to those of humans [1] - This study opens new pathways for cognitive science in artificial intelligence and provides a theoretical framework for constructing human-like cognitive structures in AI systems [1] - The research findings were published in the international academic journal "Nature Machine Intelligence" on June 9 [1]
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
字节跳动&清华大学开源多模态时序大模型ChatTS,可实现时序数据对话与推理
机器之心· 2025-05-22 10:25
Core Viewpoint - The article discusses the development of ChatTS, a multimodal large language model (LLM) designed to support multivariate time series question answering and reasoning, addressing the limitations of existing models in handling time series data [1][6][14]. Group 1: Background and Motivation - The rapid advancement of multimodal LLMs has led to breakthroughs in various fields, but research on time series data integration remains limited [1][6]. - Existing attempts, such as TimeLLM, primarily focus on predictive tasks, failing to meet the complex understanding and reasoning needs in applications like AIOps and finance [1][6]. - There is a growing demand for LLMs that can handle time series data natively, enabling them to understand the shapes, fluctuations, and semantic meanings of time series [6][11]. Group 2: Challenges in Time Series Modeling - Traditional time series analysis methods often rely on statistical or AI models that require extensive task-specific training and structured input/output, lacking generalizability and interpretability [6][11]. - Current LLMs cannot directly process raw time series data, leading to limitations in existing approaches that convert time series into text or images [12][13]. - The scarcity of aligned time series and text data, along with the structural complexity of time series, poses significant challenges for model training and evaluation [11][12]. Group 3: ChatTS Development - ChatTS employs a "purely synthetic-driven" approach to overcome the lack of labeled data, creating an end-to-end data generation and model training framework [15]. - A detailed attribute system for time series is defined, ensuring the generated time series are diverse and accurately correspond to natural language descriptions [18]. - The model architecture is based on Qwen2.5-14B-Instruct, designed to natively perceive time series data by segmenting it into small patches and embedding it into the text context [22][23]. Group 4: Performance Evaluation - ChatTS has been evaluated using three datasets covering real-world and synthetic time series data, assessing alignment and reasoning tasks across 12 subcategories [31]. - In alignment tasks, ChatTS significantly outperformed baseline models, achieving F1 score improvements of 46% to 75% and over 80% accuracy in numerical tasks [32][33]. - For reasoning tasks, ChatTS demonstrated an average improvement of 25.8% over baseline models, showcasing its enhanced understanding capabilities [34]. Group 5: Future Potential - ChatTS represents a new paradigm in training multimodal models with synthetic data, indicating high potential for future applications in causal reasoning and root cause analysis [35].
ICML 2025 Spotlight | 多模态大模型暴露短板?EMMA基准深度揭秘多模态推理能力
机器之心· 2025-05-20 04:58
「三个点电荷 + Q、-2Q 和 + 3Q 等距放置,哪个向量最能描述作用在 + Q 电荷上的净电力方向?」 在解这道题时,我们可以通过绘制受力分析草图轻松解决。但即使是先进的多模态大语言模型,如 GPT-4o,也可能在理解「同性相斥」的基本物理原则时,错误 地判断斥力的方向(例如,错误地将 + 3Q 对 + Q 的斥力方向判断为右下方而非正确的左上方)。 这个看似简单的物理问题,却暴露了多模态大模型一个「致命缺陷」: 当前的 MLLMs 仍然无法进行需要深度视觉与文本融合的复杂多模态推理 !一项最新研究 推出的 EMMA 基准测试,如同一面「照妖镜」,揭示了即使是顶尖 MLLMs 也在这关键能力上显著不足。 目前该研究已被 ICML 2025 接收为 spotlight,代码数据已全部开源 ! 目前已有多个模型 / 方法在 EMMA 上验证其多模态推理能力,研究发现: 即使最先进的模型 ——Gemini-2.5-pro-exp-03-25 ,或者是能够进行视觉工具调用的 o3/o4-mini 模型在 EMMA 上的表现仍然落后人类专家超 20% ! 标题: Can MLLMs Reason in Multi ...