多模态大模型

Search documents
本周日不见不散!CVPR 2025北京论文分享会最后报名了
机器之心· 2025-06-03 08:57
前几天,谷歌在 I/O 2025 大会上正式发布了其最新一代 AI 视频生成模型 Veo 3,在生成高质量视频的同时首次实现了音画同步。对于 Veo 3 的震撼效果,有人高 度评价称,「它会是不亚于 OpenAI Sora 的跨时代产品」,标志着 AI 视频进入到了真正的「有声时代」。 从中可以发现,虽然当前 AI 社区已有的大模型已经足够惊艳,但得益于架构的创新、算力集群的投入,仍然会「卷」出一些新东西来。比如视频生成领域,从最 初的无声进化到如今的有声,提升明显;再比如多模态领域,逐渐朝着理解与生成大一统的方向演进。 因此,为让从业者全面了解 AI 社区涌现的最新创新成果和发展趋势,机器之心计划 6 月 8 日在北京举办「CVPR 2025 论文分享会」,围绕着多模态、视频生成等 热门主题邀请顶级专家、论文作者与现场参会观众共同交流。 作为计算机视觉领域中最重要的国际会议之一,CVPR 具有极高的含金量,每年都会吸引大量研究机构和高校参会。今年,CVPR 2025 共收到 13008 份论文投 稿,最终接收 2878 篇论文,整体接收率为 22.1%。 作为一场为国内 AI 人才打造的盛会,本次论文分享会 ...
2025年中国多模态大模型行业核心技术现状 关键在表征、翻译、对齐、融合、协同技术【组图】
Qian Zhan Wang· 2025-06-03 05:12
Core Insights - The article discusses the core technologies of multimodal large models, focusing on representation learning, translation, alignment, fusion, and collaborative learning [1][2][7][11][14]. Representation Learning - Representation learning is fundamental for multimodal tasks, addressing challenges such as combining heterogeneous data and handling varying noise levels across different modalities [1]. - Prior to the advent of Transformers, different modalities required distinct representation learning models, such as CNNs for computer vision (CV) and LSTMs for natural language processing (NLP) [1]. - The emergence of Transformers has enabled the unification of multiple modalities and cross-modal tasks, leading to a surge in multimodal pre-training models post-2019 [1]. Translation - Cross-modal translation aims to map source modalities to target modalities, such as generating descriptive sentences from images or vice versa [2]. - The use of syntactic templates allows for structured predictions, where specific words are filled in based on detected attributes [2]. - Encoder-decoder architectures are employed to encode source modality data into latent features, which are then decoded to generate the target modality [2]. Alignment - Alignment is crucial in multimodal learning, focusing on establishing correspondences between different data modalities to enhance understanding of complex scenarios [7]. - Explicit alignment involves categorizing instances with multiple components and measuring similarity, utilizing both unsupervised and supervised methods [7][8]. - Implicit alignment leverages latent representations for tasks without strict alignment, improving performance in applications like visual question answering (VQA) and machine translation [8]. Fusion - Fusion combines multimodal data or features for unified analysis and decision-making, enhancing task performance by integrating information from various modalities [11]. - Early fusion merges features at the feature level, while late fusion combines outputs at the decision level, with hybrid fusion incorporating both approaches [11][12]. - The choice of fusion method depends on the task and data, with neural networks becoming a popular approach for multimodal fusion [12]. Collaborative Learning - Collaborative learning utilizes data from one modality to enhance the model of another modality, categorized into parallel, non-parallel, and hybrid methods [14][15]. - Parallel learning requires direct associations between observations from different modalities, while non-parallel learning relies on overlapping categories [15]. - Hybrid methods connect modalities through shared datasets, allowing one modality to influence the training of another, applicable across various tasks [15].
万帧?单卡!智源研究院开源轻量级超长视频理解模型Video-XL-2
机器之心· 2025-06-03 04:06
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation long video understanding model developed by Zhiyuan Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of multimodal large models in understanding long video content [2][6]. Technical Overview - Video-XL-2 consists of three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [3]. - The model uses SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [3]. - The training strategy involves a four-stage progressive training design to build strong long video understanding capabilities, utilizing image/video-text pairs and large-scale high-quality datasets [4]. Performance Metrics - Video-XL-2 outperforms existing lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, Video-MME, and LVBench, achieving state-of-the-art performance [11]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle compared to previous models [16]. - Video-XL-2 encodes 2048 frames of video in just 12 seconds, showcasing its superior processing speed and efficiency [19]. Efficiency Innovations - The model incorporates a chunk-based pre-filling strategy to reduce computational costs and memory usage by dividing long videos into segments [8]. - A bi-granularity key-value (KV) decoding mechanism allows the model to selectively load dense or sparse KVs based on task requirements, enhancing decoding efficiency [8]. Application Potential - Video-XL-2 demonstrates high application potential in various scenarios, including film plot question answering, surveillance anomaly detection, and content summarization for films and game live streams [20][22]. - The model's advanced video understanding capabilities provide effective support for complex video analysis needs in real-world applications [20].
2025年中国多模态大模型行业模型现状 图像、视频、音频、3D模型等终将打通和融合【组图】
Qian Zhan Wang· 2025-06-01 05:09
Core Insights - The exploration of multimodal large models is making gradual progress, with a focus on breakthroughs in visual modalities, aiming for an "Any-to-Any" model that requires successful pathways across various modalities [1] - The industry is currently concentrating on enhancing perception and generation models in image, video, and 3D modalities, with the goal of achieving cross-modal integration and sharing [1] Multimodal Large Models in Image - Prior to the rise of LLMs in 2023, the industry had already established a solid foundation in image understanding and generation, resulting in models like CLIP, Stable Diffusion, and GAN, which led to applications such as Midjourney and DALL·E [2] - The industry is actively exploring the integration of Transformer models into image-related tasks, with significant outcomes including GLIP, SAM, and GPT-V [2] Multimodal Large Models in Video - Video generation is being approached by transferring image generation models to video, utilizing image data for training and aligning temporal dimensions to achieve text-to-video results [5] - Recent advancements include models like VideoLDM and Sora, which demonstrate significant breakthroughs in video generation using the Diffusion Transformer architecture [5] Multimodal Large Models in 3D - The generation of 3D models is being explored by extending 2D image generation methods, with key models such as 3D GAN, MeshDiffusion, and Instant3D emerging in the industry [8][9] - 3D data representation includes various formats like meshes, point clouds, and NeRF, with NeRF being a critical technology for 3D data representation [9] Multimodal Large Models in Audio - AI technologies related to audio have matured, with recent applications of Transformer models enhancing audio understanding and generation, exemplified by projects like Whisper large-v3 and VALL-E [11] - The evolution of speech technology is categorized into three stages, with a focus on enhancing generalization capabilities across multiple languages and tasks [11]
360开源高质量图文对齐数据集!收纳1200万张图像+1000万组细粒度负样本,让模型告别“图文不符”
量子位· 2025-05-31 03:45
FineHARD团队 投稿 量子位 | 公众号 QbitAI 如何让CLIP模型更关注细粒度特征学习,避免"近视"? 360人工智能研究团队提出了 FG-CLIP ,可以明显缓解CLIP的"视觉近视"问题。 让模型能更关注于正确的细节描述,而不是更全局但是错误的描述。 模型成功的关键在于 高质量数据 。 就在最近,冷大炜博士团队将这一"秘籍"开源: FineHARD高质量图文对齐数据集 。该数据集主打两个核心特点: 细粒度+难负样本 。 FineHARD是FG-CLIP模型背后的高质量图文对齐数据集,以规模化与精细化为特色, 包含1200万张图像 及其对应的长、短描述文本,覆 盖 4000万 个边界框,每个边界框均附带细粒度区域描述(Fine-Grained Regional Description)。 此外,FineHARD创新性地引入了 1000万组 细粒度难负样本(Hard Fine-grained Negative Samples),这些经过算法筛选的干扰样本能 够有效提升模型对相似目标的区分能力。 基于该数据集训练的FG-CLIP已被ICML25接收,它在各种下游任务中显著优于原始CLIP和其他最先 ...
云从科技多模态大模型登顶OpenCompass全球多模态榜单
news flash· 2025-05-29 07:12
Core Insights - Yuncong Technology's self-developed model, Congrong, has achieved the top position in the latest global multimodal ranking on the OpenCompass platform with a score of 80.7 [1] - The model excels in various professional fields, including medical health, mathematical logic, and art design, demonstrating strong performance across eight core datasets encompassing visual perception, cognitive understanding, and cross-domain applications [1]
2025年中国多模态大模型行业市场规模、产业链、竞争格局分析及行业发趋势研判:将更加多元和深入,应用前景越来越广阔[图]
Chan Ye Xin Xi Wang· 2025-05-29 01:47
Core Insights - The multi-modal large model market in China is projected to reach 15.63 billion yuan in 2024, an increase of 6.54 billion yuan from 2023, and is expected to grow to 23.48 billion yuan in 2025, indicating strong market demand and government support [1][6][19] Multi-Modal Large Model Industry Definition and Classification - Multi-modal large models are AI systems capable of processing and understanding various data forms, including text, images, audio, and video, using deep learning technologies like the Transformer architecture [2][4] Industry Development History - The multi-modal large model industry has evolved through several stages: task-oriented phase, visual-language pre-training phase, and the current multi-modal large model phase, focusing on enhancing cross-modal understanding and generation capabilities [4] Current Industry Status - The multi-modal large model industry has gained significant attention due to its data processing capabilities and diverse applications, with a market size projected to grow substantially in the coming years [6][19] Application Scenarios - The largest application share of multi-modal large models is in the digital human sector at 24%, followed by gaming and advertising at 13% each, and smart marketing and social media at 10% each [8] Industry Value Chain - The industry value chain consists of upstream components like AI chips and hardware, midstream multi-modal large models, and downstream applications across various sectors including education, gaming, and public services [10][12] Competitive Landscape - Major players in the multi-modal large model space include institutions and companies like the Chinese Academy of Sciences, Huawei, Baidu, Tencent, and Alibaba, with various models being developed to optimize training costs and enhance capabilities [16][17] Future Development Trends - The multi-modal large model industry is expected to become more intelligent and humanized, providing richer and more personalized user experiences, with applications expanding across various fields such as finance, education, and content creation [19]
每日市场观察-20250528
Caida Securities· 2025-05-28 13:47
Market Overview - On May 27, the market experienced a downward trend, with the Shanghai Composite Index falling by 0.18%, the Shenzhen Component down by 0.61%, and the ChiNext Index decreasing by 0.68%[2] - The total trading volume on May 28 was 1.02 trillion CNY, a decrease of approximately 10 billion CNY compared to the previous trading day[1] Industry Performance - More than half of the industries saw declines, with textiles, pharmaceuticals, beauty, and care sectors showing slight increases, while non-ferrous metals, electronics, automotive, and machinery sectors had the largest declines[1] - Over 80% of stocks had price fluctuations limited to within 3%[1] Economic Indicators - From January to April, the total profit of industrial enterprises above designated size reached 21,170.2 billion CNY, reflecting a year-on-year growth of 1.4%[5] - State-owned enterprises reported a profit of 7,022.8 billion CNY, down 4.4% year-on-year, while private enterprises saw a profit increase of 4.3% to 5,706.8 billion CNY[5] Capital Flow - On May 27, the Shanghai Stock Exchange saw a net inflow of 1.027 billion CNY, while the Shenzhen Stock Exchange experienced a net outflow of 2.048 billion CNY[4] - The top three sectors for capital inflow were chemical pharmaceuticals, computer equipment, and agricultural chemicals, while the largest outflows were from industrial metals, semiconductors, and passenger vehicles[4] Currency and Trade - The recent appreciation of the RMB against the USD has been noted, while it remains weaker against other major currencies, indicating ongoing uncertainties in the market due to the trade environment[1] Fund Dynamics - Sixteen out of the first batch of 26 new floating-rate funds have begun issuance, with a focus on performance-based fee structures that align the interests of fund managers and investors[12][13]
整理:每日科技要闻速递(5月28日)
news flash· 2025-05-27 23:27
New Energy Vehicles - BYD's blade battery has passed the new national standard ahead of schedule [2] - As of May 26, Xiaomi's SU7 Ultra has achieved a lock order volume of 23,000 units [2] - Changan Automobile's chairman predicts that within two years, industry competition will return to a healthier environment [2] Integrated Circuits (Chips) - Samsung plans to launch a glass intermediary layer by 2028 [2] - Samsung is restructuring its HBM team to focus on customized HBM [2] - Samsung will stop accepting multi-layer NAND orders after June [2] - TSMC will produce MicroLED-based optical communication interconnect products [2] - TSMC is establishing a European chip design center in Munich, Germany [2] Artificial Intelligence - Shanghai has launched its first multimodal large model in the transportation sector, which is expected to improve intersection traffic efficiency by 15% [2] - Nvidia's supplier has resolved rack overheating issues and has begun shipping Blackwell chips [2] - Tencent Cloud has launched the data accelerator GooseFS 2.0, providing comprehensive support for all AI business scenarios [2] Other - Meituan clarifies that "at all costs" refers to combating internal competition [2] - Salesforce plans to acquire Informatica for $8 billion [2] - Canalys forecasts a moderate 3% growth in the African smartphone market by 2025 [2] - Douyin is trialing new regulations that may classify "unboxing" events as controversial [2] - Texas Governor signs a law requiring age verification for Apple and Google app stores [2] - Elon Musk's Neuralink has raised $600 million in a deal, valuing the brain-computer interface company at $9 billion [2] - Apple plans to release dedicated video game applications for its devices to enhance its influence in the gaming industry [2]
全日程公布|谷歌Veo 3惊艳发布后,这场CVPR分享会值得每个AI人「听个声」
机器之心· 2025-05-27 06:38
前几天,谷歌在 I/O 2025 大会上正式发布了其最新一代 AI 视频生成模型 Veo 3,在生成高质量视频的同时首次实现了音画同步。对于 Veo 3 的震撼效果,有人高 度评价称,「它会是不亚于 OpenAI Sora 的跨时代产品」,标志着 AI 视频进入到了真正的「有声时代」。 从中可以发现,虽然当前 AI 社区已有的大模型已经足够惊艳,但得益于架构的创新、算力集群的投入,仍然会「卷」出一些新东西来。比如视频生成领域,从最 初的无声进化到如今的有声,提升明显;再比如多模态领域,逐渐朝着理解与生成大一统的方向演进。 因此,为让从业者全面了解 AI 社区涌现的最新创新成果和发展趋势,机器之心计划 6 月 8 日在北京举办「CVPR 2025 论文分享会」,围绕着多模态、视频生成等 热门主题邀请顶级专家、论文作者与现场参会观众共同交流。 作为计算机视觉领域中最重要的国际会议之一,CVPR 具有极高的含金量,每年都会吸引大量研究机构和高校参会。今年,CVPR 2025 共收到 13008 份论文投 稿,最终接收 2878 篇论文,整体接收率为 22.1%。 作为一场为国内 AI 人才打造的盛会,本次论文分享会 ...