Workflow
多模态大模型
icon
Search documents
AI自发形成人类级认知!我国科技学家揭示多模态大模型涌现类人物体概念表征
Huan Qiu Wang· 2025-06-10 02:09
研究人员从海量大模型行为数据中提取出66个"心智维度",并为这些维度赋予了语义标签。研究发现, 这些维度是高度可解释的,且与大脑类别选择区域(如处理面孔的FFA、处理场景的PPA、处理躯体的 EBA)的神经活动模式显著相关。 研究还对比了多个模型在行为选择模式上与人类的一致性(Human consistency)。结果显示,多模态大 模型(如 Gemini_Pro_Vision、Qwen2_VL)在一致性方面表现更优。此外,研究还揭示了人类在做决策 时更倾向于结合视觉特征和语义信息进行判断,而大模型则倾向于依赖语义标签和抽象概念。本研究表 明大语言模型并非"随机鹦鹉",其内部存在着类似人类对现实世界概念的理解。 相关研究成果以Human-like object concept representations emerge naturally in multimodal large language models为题,发表于《自然·机器智能》(Nature Machine Intelligence)。(青山) 那么,大语言模型(LLMs)是否能从语言和多模态数据中发展出类似人类的物体概念表征? 近日,中国科学院 ...
生数科技CEO骆怡航:从模型到生产,多模态AI如何推动视频创作更高效
硬AI· 2025-06-09 14:07
Core Insights - The article emphasizes that multimodal large models are at a critical turning point for large-scale production, driven by rapid technological iterations and strong industry demand [1][3][5]. Group 1: Industry Trends - The rapid iteration of audio and video generation models has significantly improved effectiveness, speed, and cost [6][9]. - There is a strong demand in the industry for video content production, addressing traditional pain points such as long cycles, high costs, and the need for specialized skills [7][9]. - The pace of industry adoption for video content-related applications is accelerating, with many sectors actively exploring and implementing solutions [7][9]. Group 2: Company Focus and Solutions - The company focuses on four key conditions for scaling video generation: content creativity, quality, efficiency, and cost reduction, aiming for at least a hundredfold improvement in efficiency compared to traditional methods [3][9]. - The company’s product, Vidu, has seen a 300% increase in professional creation since its launch, with significant growth in generation volume, payment, and usage duration [4][22]. - Vidu 2.0 can generate content in just 5 seconds, with enhancements in quality and features such as high-definition versions and audio effects [10][11]. Group 3: Market Applications - The company targets eight major industries and thirty application scenarios, focusing on professional and enterprise users [9][10]. - Applications of Vidu span various sectors, including internet advertising, animation, e-commerce, and education, with 80% of professional users in demanding scenarios [4][22]. - The company has successfully collaborated with notable partners, including Sony Pictures, to produce high-quality promotional content efficiently [20][21]. Group 4: User Engagement and Community - The Vidu platform has over 30 million users across more than 200 countries, with a strong community for sharing creative ideas and inspirations [11][12]. - Daily, users engage in millions of creative expressions, contributing to a vibrant ecosystem of content creation [12][22]. - The company aims to empower users by enhancing their creative capabilities while maintaining efficiency and cost-effectiveness in content production [23].
我国科学家研究揭示多模态大模型概念表征机制
Xin Hua She· 2025-06-09 09:32
传统人工智能研究聚焦于物体识别准确率,却鲜少探讨模型是否真正"理解"物体含义。何晖光说:"当 前人工智能可以区分猫狗图片,但这种'识别'与人类'理解'猫狗的本质区别仍有待揭示。" 研究团队从认知神经科学经典理论出发,设计了一套融合计算建模、行为实验与脑科学的创新范式,并 构建了人工智能大模型的"概念地图"。 何晖光介绍,研究团队从海量大模型行为数据中提取出66个"心智维度",并为这些维度赋予了语义标 签。通过研究发现这些维度是高度可解释的,且与大脑类别选择区域的神经活动模式显著相关。研究还 对比了多个模型在行为选择模式上与人类的一致性,结果显示多模态大模型在一致性方面表现更优。 此外,研究还揭示了人类在做决策时更倾向于结合视觉特征和语义信息进行判断,而大模型则倾向于依 赖语义标签和抽象概念。本研究表明大语言模型内部存在着类似人类对现实世界概念的理解。(记者宋 晨) 记者6月9日从中国科学院自动化研究所获悉,该所与中国科学院脑科学与智能技术卓越创新中心的联合 团队在《自然·机器智能》发表相关研究,首次证实多模态大语言模型能够自发形成与人类高度相似的 物体概念表征系统,为人工智能认知科学提供了新路径,也为构建类人 ...
为什么浮亏似海深,浮盈一口闷?
Ge Long Hui· 2025-06-09 01:34
Market Performance - The Dow Jones Industrial Average has historically surpassed 40,000 points for the first time, with a weekly increase of 1.24% [1] - The Nasdaq index rose by 2.1% to reach a new high, while the S&P 500 increased by 1.5%, also achieving a new high [1] - Technology stocks performed well, with Microsoft up 1.5%, Apple up 3.7%, and Nvidia up 2.9%, marking four consecutive weeks of gains [1] Hong Kong Market Dynamics - Despite the lack of significant foreign investment, the Hong Kong stock market has seen gains, with the Hang Seng Index rising by 3.11% [3] - The Hang Seng Tech Index increased by 3.79%, outperforming U.S. markets [3] - The rise in Hong Kong stocks is attributed to expectations of the cancellation of the dividend tax, making stocks more attractive for long-term investors [3] Corporate Earnings - Major internet companies such as Tencent, Baidu, and JD.com reported better-than-expected earnings, leading to stock price increases [3] - Alibaba's earnings fell short of expectations, resulting in a decline in its stock price [3] U.S. Tariff Implications - New U.S. tariffs will affect $18 billion worth of goods from China, including steel, aluminum, semiconductors, and batteries [6] - The tariff on electric vehicles will increase from 25% to 100%, reflecting the rapid growth of China's automotive export sector [6] Export Data - Belgium leads in vehicle exports from China with 175,437 units, showing an 11.3% increase year-on-year [7] - The UK and the Philippines follow, with exports of 125,314 and 115,423 vehicles, respectively, reflecting significant growth rates [7] Investment Strategy - The semiconductor and chip sectors are identified as cyclical industries with potential for significant returns, especially during bear markets [8] - The importance of maintaining a long-term perspective on investments is emphasized, suggesting that investors should focus on price trends rather than entry costs [8]
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao· 2025-06-08 13:27
Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]
多模态模型挑战北京杭州地铁图!o3成绩显著,但跟人类有差距
量子位· 2025-06-07 05:02
ReasonMap团队 投稿 量子位 | 公众号 QbitAI 近年来,大语言模型(LLMs)以及多模态大模型(MLLMs)在多种场景理解和复杂推理任务中取得突破性进展。 然而,一个关键问题仍然值得追问: 多模态大模型(MLLMs),真的能"看懂图"了吗? 特别是在面对结构复杂、细节密集的图像时,它们是否具备细粒度视觉理解与空间推理能力,比如挑战一下高清 地铁图 这种。 为此,来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。 看得出来北京、杭州的地铁图难倒了一大片模型。 这是首个聚焦于 高分辨率交通图(主要为地铁图)的多模态推理评测基准,专为评估大模型在理解图像中细粒度的结构化空间信息 方面的 能力而设计。 结果发现,当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈,尤其在 跨线路路径规划 上常出现视觉混淆或站点遗漏。 而经强化学习后训练的闭源推理模型(如 GPT-o3)在多个维度上 显著优于 现有开源模型,但与人类水平相比仍存在明显差距。 在面对不同国家地区的地铁图中,四个代表性 MLLM(Qwen2.5-VL-72B-I(蓝色)、 I ...
预见 2025:《2025 年中国多模态大模型行业全景图谱》(附市场现状、竞争格局和发展趋势等)
Sou Hu Cai Jing· 2025-06-06 14:09
行业主要上市公司:阿里巴巴 ( 09988.HK,BABA.US ) ; 百度 ( 09888.HK,BIDU.US ) ; 腾讯 ( 00700.HK, TCEHY ) ;科大讯飞 ( 002230.SZ ) ;万兴科技 ( 300624.SZ ) ;三六零 ( 601360.SH ) ;昆仑万维 ( 300418.SZ ) ; 云 从科技 ( 688327.SH ) ;拓尔思 ( 300229.SZ ) 等 本文核心数据:备案数量 ; 收费模式 ; 市场规模 ; 区域占比等 产业概况 1、定义及特征 多模态 ( Multimodality ) 是指集成和处理两种或两种以上不同类型的信息或数据的方法和技术。在机器学 习和人工智能领域,多模态涉及的数据类型通常包括但不限于文本、图像、视频、音频和传感器数据。多 模态系统的目的是利用来自多种模态的信息来提高任务的性能,提供更丰富的用户体验,或者获得更全面 的数据分析结果。多模态大型语言模型 ( Multimodal Large Language Models,简称 MLLMs ) 是一类结合了 大型语言模型 ( Large Language Models,简称 ...
单卡搞定万帧视频理解!智源研究院开源轻量级超长视频理解模型Video-XL-2
量子位· 2025-06-04 05:21
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation of long video understanding model developed by Zhiyuan Research Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of open-source models in processing and understanding long video content [1][3]. Technical Overview - Video-XL-2 is designed with three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [4][6]. - The model utilizes SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [6][11]. - The training strategy involves a four-stage progressive training design to build robust long video understanding capabilities [8][10]. Performance Improvements - Video-XL-2 shows superior performance in long video understanding tasks, achieving leading levels on benchmarks such as MLVU, Video-MME, and LVBench compared to existing open-source models [9][15]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle [19][23]. - It can encode 2048 frames of video in just 12 seconds, demonstrating remarkable speed and efficiency [24][28]. Application Potential - Video-XL-2 has high application potential in various real-world scenarios, including film content analysis, plot understanding, and anomaly detection in surveillance videos [28][30]. - Specific examples of its application include answering questions about movie scenes and detecting unexpected events in surveillance footage [30][32].
本周日不见不散!CVPR 2025北京论文分享会最后报名了
机器之心· 2025-06-03 08:57
前几天,谷歌在 I/O 2025 大会上正式发布了其最新一代 AI 视频生成模型 Veo 3,在生成高质量视频的同时首次实现了音画同步。对于 Veo 3 的震撼效果,有人高 度评价称,「它会是不亚于 OpenAI Sora 的跨时代产品」,标志着 AI 视频进入到了真正的「有声时代」。 从中可以发现,虽然当前 AI 社区已有的大模型已经足够惊艳,但得益于架构的创新、算力集群的投入,仍然会「卷」出一些新东西来。比如视频生成领域,从最 初的无声进化到如今的有声,提升明显;再比如多模态领域,逐渐朝着理解与生成大一统的方向演进。 因此,为让从业者全面了解 AI 社区涌现的最新创新成果和发展趋势,机器之心计划 6 月 8 日在北京举办「CVPR 2025 论文分享会」,围绕着多模态、视频生成等 热门主题邀请顶级专家、论文作者与现场参会观众共同交流。 作为计算机视觉领域中最重要的国际会议之一,CVPR 具有极高的含金量,每年都会吸引大量研究机构和高校参会。今年,CVPR 2025 共收到 13008 份论文投 稿,最终接收 2878 篇论文,整体接收率为 22.1%。 作为一场为国内 AI 人才打造的盛会,本次论文分享会 ...
2025年中国多模态大模型行业核心技术现状 关键在表征、翻译、对齐、融合、协同技术【组图】
Qian Zhan Wang· 2025-06-03 05:12
Core Insights - The article discusses the core technologies of multimodal large models, focusing on representation learning, translation, alignment, fusion, and collaborative learning [1][2][7][11][14]. Representation Learning - Representation learning is fundamental for multimodal tasks, addressing challenges such as combining heterogeneous data and handling varying noise levels across different modalities [1]. - Prior to the advent of Transformers, different modalities required distinct representation learning models, such as CNNs for computer vision (CV) and LSTMs for natural language processing (NLP) [1]. - The emergence of Transformers has enabled the unification of multiple modalities and cross-modal tasks, leading to a surge in multimodal pre-training models post-2019 [1]. Translation - Cross-modal translation aims to map source modalities to target modalities, such as generating descriptive sentences from images or vice versa [2]. - The use of syntactic templates allows for structured predictions, where specific words are filled in based on detected attributes [2]. - Encoder-decoder architectures are employed to encode source modality data into latent features, which are then decoded to generate the target modality [2]. Alignment - Alignment is crucial in multimodal learning, focusing on establishing correspondences between different data modalities to enhance understanding of complex scenarios [7]. - Explicit alignment involves categorizing instances with multiple components and measuring similarity, utilizing both unsupervised and supervised methods [7][8]. - Implicit alignment leverages latent representations for tasks without strict alignment, improving performance in applications like visual question answering (VQA) and machine translation [8]. Fusion - Fusion combines multimodal data or features for unified analysis and decision-making, enhancing task performance by integrating information from various modalities [11]. - Early fusion merges features at the feature level, while late fusion combines outputs at the decision level, with hybrid fusion incorporating both approaches [11][12]. - The choice of fusion method depends on the task and data, with neural networks becoming a popular approach for multimodal fusion [12]. Collaborative Learning - Collaborative learning utilizes data from one modality to enhance the model of another modality, categorized into parallel, non-parallel, and hybrid methods [14][15]. - Parallel learning requires direct associations between observations from different modalities, while non-parallel learning relies on overlapping categories [15]. - Hybrid methods connect modalities through shared datasets, allowing one modality to influence the training of another, applicable across various tasks [15].