多模态大模型
Search documents
万帧?单卡!智源研究院开源轻量级超长视频理解模型Video-XL-2
机器之心· 2025-06-03 04:06
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation long video understanding model developed by Zhiyuan Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of multimodal large models in understanding long video content [2][6]. Technical Overview - Video-XL-2 consists of three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [3]. - The model uses SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [3]. - The training strategy involves a four-stage progressive training design to build strong long video understanding capabilities, utilizing image/video-text pairs and large-scale high-quality datasets [4]. Performance Metrics - Video-XL-2 outperforms existing lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, Video-MME, and LVBench, achieving state-of-the-art performance [11]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle compared to previous models [16]. - Video-XL-2 encodes 2048 frames of video in just 12 seconds, showcasing its superior processing speed and efficiency [19]. Efficiency Innovations - The model incorporates a chunk-based pre-filling strategy to reduce computational costs and memory usage by dividing long videos into segments [8]. - A bi-granularity key-value (KV) decoding mechanism allows the model to selectively load dense or sparse KVs based on task requirements, enhancing decoding efficiency [8]. Application Potential - Video-XL-2 demonstrates high application potential in various scenarios, including film plot question answering, surveillance anomaly detection, and content summarization for films and game live streams [20][22]. - The model's advanced video understanding capabilities provide effective support for complex video analysis needs in real-world applications [20].
2025年中国多模态大模型行业模型现状 图像、视频、音频、3D模型等终将打通和融合【组图】
Qian Zhan Wang· 2025-06-01 05:09
Core Insights - The exploration of multimodal large models is making gradual progress, with a focus on breakthroughs in visual modalities, aiming for an "Any-to-Any" model that requires successful pathways across various modalities [1] - The industry is currently concentrating on enhancing perception and generation models in image, video, and 3D modalities, with the goal of achieving cross-modal integration and sharing [1] Multimodal Large Models in Image - Prior to the rise of LLMs in 2023, the industry had already established a solid foundation in image understanding and generation, resulting in models like CLIP, Stable Diffusion, and GAN, which led to applications such as Midjourney and DALL·E [2] - The industry is actively exploring the integration of Transformer models into image-related tasks, with significant outcomes including GLIP, SAM, and GPT-V [2] Multimodal Large Models in Video - Video generation is being approached by transferring image generation models to video, utilizing image data for training and aligning temporal dimensions to achieve text-to-video results [5] - Recent advancements include models like VideoLDM and Sora, which demonstrate significant breakthroughs in video generation using the Diffusion Transformer architecture [5] Multimodal Large Models in 3D - The generation of 3D models is being explored by extending 2D image generation methods, with key models such as 3D GAN, MeshDiffusion, and Instant3D emerging in the industry [8][9] - 3D data representation includes various formats like meshes, point clouds, and NeRF, with NeRF being a critical technology for 3D data representation [9] Multimodal Large Models in Audio - AI technologies related to audio have matured, with recent applications of Transformer models enhancing audio understanding and generation, exemplified by projects like Whisper large-v3 and VALL-E [11] - The evolution of speech technology is categorized into three stages, with a focus on enhancing generalization capabilities across multiple languages and tasks [11]
360开源高质量图文对齐数据集!收纳1200万张图像+1000万组细粒度负样本,让模型告别“图文不符”
量子位· 2025-05-31 03:45
FineHARD团队 投稿 量子位 | 公众号 QbitAI 如何让CLIP模型更关注细粒度特征学习,避免"近视"? 360人工智能研究团队提出了 FG-CLIP ,可以明显缓解CLIP的"视觉近视"问题。 让模型能更关注于正确的细节描述,而不是更全局但是错误的描述。 模型成功的关键在于 高质量数据 。 就在最近,冷大炜博士团队将这一"秘籍"开源: FineHARD高质量图文对齐数据集 。该数据集主打两个核心特点: 细粒度+难负样本 。 FineHARD是FG-CLIP模型背后的高质量图文对齐数据集,以规模化与精细化为特色, 包含1200万张图像 及其对应的长、短描述文本,覆 盖 4000万 个边界框,每个边界框均附带细粒度区域描述(Fine-Grained Regional Description)。 此外,FineHARD创新性地引入了 1000万组 细粒度难负样本(Hard Fine-grained Negative Samples),这些经过算法筛选的干扰样本能 够有效提升模型对相似目标的区分能力。 基于该数据集训练的FG-CLIP已被ICML25接收,它在各种下游任务中显著优于原始CLIP和其他最先 ...
云从科技多模态大模型登顶OpenCompass全球多模态榜单
news flash· 2025-05-29 07:12
Core Insights - Yuncong Technology's self-developed model, Congrong, has achieved the top position in the latest global multimodal ranking on the OpenCompass platform with a score of 80.7 [1] - The model excels in various professional fields, including medical health, mathematical logic, and art design, demonstrating strong performance across eight core datasets encompassing visual perception, cognitive understanding, and cross-domain applications [1]
2025年中国多模态大模型行业市场规模、产业链、竞争格局分析及行业发趋势研判:将更加多元和深入,应用前景越来越广阔[图]
Chan Ye Xin Xi Wang· 2025-05-29 01:47
Core Insights - The multi-modal large model market in China is projected to reach 15.63 billion yuan in 2024, an increase of 6.54 billion yuan from 2023, and is expected to grow to 23.48 billion yuan in 2025, indicating strong market demand and government support [1][6][19] Multi-Modal Large Model Industry Definition and Classification - Multi-modal large models are AI systems capable of processing and understanding various data forms, including text, images, audio, and video, using deep learning technologies like the Transformer architecture [2][4] Industry Development History - The multi-modal large model industry has evolved through several stages: task-oriented phase, visual-language pre-training phase, and the current multi-modal large model phase, focusing on enhancing cross-modal understanding and generation capabilities [4] Current Industry Status - The multi-modal large model industry has gained significant attention due to its data processing capabilities and diverse applications, with a market size projected to grow substantially in the coming years [6][19] Application Scenarios - The largest application share of multi-modal large models is in the digital human sector at 24%, followed by gaming and advertising at 13% each, and smart marketing and social media at 10% each [8] Industry Value Chain - The industry value chain consists of upstream components like AI chips and hardware, midstream multi-modal large models, and downstream applications across various sectors including education, gaming, and public services [10][12] Competitive Landscape - Major players in the multi-modal large model space include institutions and companies like the Chinese Academy of Sciences, Huawei, Baidu, Tencent, and Alibaba, with various models being developed to optimize training costs and enhance capabilities [16][17] Future Development Trends - The multi-modal large model industry is expected to become more intelligent and humanized, providing richer and more personalized user experiences, with applications expanding across various fields such as finance, education, and content creation [19]
每日市场观察-20250528
Caida Securities· 2025-05-28 13:47
Market Overview - On May 27, the market experienced a downward trend, with the Shanghai Composite Index falling by 0.18%, the Shenzhen Component down by 0.61%, and the ChiNext Index decreasing by 0.68%[2] - The total trading volume on May 28 was 1.02 trillion CNY, a decrease of approximately 10 billion CNY compared to the previous trading day[1] Industry Performance - More than half of the industries saw declines, with textiles, pharmaceuticals, beauty, and care sectors showing slight increases, while non-ferrous metals, electronics, automotive, and machinery sectors had the largest declines[1] - Over 80% of stocks had price fluctuations limited to within 3%[1] Economic Indicators - From January to April, the total profit of industrial enterprises above designated size reached 21,170.2 billion CNY, reflecting a year-on-year growth of 1.4%[5] - State-owned enterprises reported a profit of 7,022.8 billion CNY, down 4.4% year-on-year, while private enterprises saw a profit increase of 4.3% to 5,706.8 billion CNY[5] Capital Flow - On May 27, the Shanghai Stock Exchange saw a net inflow of 1.027 billion CNY, while the Shenzhen Stock Exchange experienced a net outflow of 2.048 billion CNY[4] - The top three sectors for capital inflow were chemical pharmaceuticals, computer equipment, and agricultural chemicals, while the largest outflows were from industrial metals, semiconductors, and passenger vehicles[4] Currency and Trade - The recent appreciation of the RMB against the USD has been noted, while it remains weaker against other major currencies, indicating ongoing uncertainties in the market due to the trade environment[1] Fund Dynamics - Sixteen out of the first batch of 26 new floating-rate funds have begun issuance, with a focus on performance-based fee structures that align the interests of fund managers and investors[12][13]
整理:每日科技要闻速递(5月28日)
news flash· 2025-05-27 23:27
New Energy Vehicles - BYD's blade battery has passed the new national standard ahead of schedule [2] - As of May 26, Xiaomi's SU7 Ultra has achieved a lock order volume of 23,000 units [2] - Changan Automobile's chairman predicts that within two years, industry competition will return to a healthier environment [2] Integrated Circuits (Chips) - Samsung plans to launch a glass intermediary layer by 2028 [2] - Samsung is restructuring its HBM team to focus on customized HBM [2] - Samsung will stop accepting multi-layer NAND orders after June [2] - TSMC will produce MicroLED-based optical communication interconnect products [2] - TSMC is establishing a European chip design center in Munich, Germany [2] Artificial Intelligence - Shanghai has launched its first multimodal large model in the transportation sector, which is expected to improve intersection traffic efficiency by 15% [2] - Nvidia's supplier has resolved rack overheating issues and has begun shipping Blackwell chips [2] - Tencent Cloud has launched the data accelerator GooseFS 2.0, providing comprehensive support for all AI business scenarios [2] Other - Meituan clarifies that "at all costs" refers to combating internal competition [2] - Salesforce plans to acquire Informatica for $8 billion [2] - Canalys forecasts a moderate 3% growth in the African smartphone market by 2025 [2] - Douyin is trialing new regulations that may classify "unboxing" events as controversial [2] - Texas Governor signs a law requiring age verification for Apple and Google app stores [2] - Elon Musk's Neuralink has raised $600 million in a deal, valuing the brain-computer interface company at $9 billion [2] - Apple plans to release dedicated video game applications for its devices to enhance its influence in the gaming industry [2]
全日程公布|谷歌Veo 3惊艳发布后,这场CVPR分享会值得每个AI人「听个声」
机器之心· 2025-05-27 06:38
前几天,谷歌在 I/O 2025 大会上正式发布了其最新一代 AI 视频生成模型 Veo 3,在生成高质量视频的同时首次实现了音画同步。对于 Veo 3 的震撼效果,有人高 度评价称,「它会是不亚于 OpenAI Sora 的跨时代产品」,标志着 AI 视频进入到了真正的「有声时代」。 从中可以发现,虽然当前 AI 社区已有的大模型已经足够惊艳,但得益于架构的创新、算力集群的投入,仍然会「卷」出一些新东西来。比如视频生成领域,从最 初的无声进化到如今的有声,提升明显;再比如多模态领域,逐渐朝着理解与生成大一统的方向演进。 因此,为让从业者全面了解 AI 社区涌现的最新创新成果和发展趋势,机器之心计划 6 月 8 日在北京举办「CVPR 2025 论文分享会」,围绕着多模态、视频生成等 热门主题邀请顶级专家、论文作者与现场参会观众共同交流。 作为计算机视觉领域中最重要的国际会议之一,CVPR 具有极高的含金量,每年都会吸引大量研究机构和高校参会。今年,CVPR 2025 共收到 13008 份论文投 稿,最终接收 2878 篇论文,整体接收率为 22.1%。 作为一场为国内 AI 人才打造的盛会,本次论文分享会 ...
上海首个交通领域多模态大模型问世 有望让路口通行效率提升15%
news flash· 2025-05-27 03:07
Core Insights - The establishment of Zhongchengjiao (Shanghai) Technology Co., Ltd. marks the first state-owned enterprise in Shanghai focused on vertical field large models, specifically in the transportation sector [1] - The launch of the "Tongda" multimodal large model represents Shanghai's first dedicated traffic model, signifying an upgrade in the city's traffic intelligence [1] Company Overview - Zhongchengjiao (Shanghai) Technology Co., Ltd. is positioned as a key player in the development of intelligent transportation solutions [1] - The company aims to enhance traffic management through advanced data processing and algorithmic support [1] Model Capabilities - The "Tongda" model serves two primary functions: acting as an "expert consultant" providing professional knowledge services and assisting in traffic organization management [1] - The model utilizes video monitoring and IoT devices to capture real-time traffic flow and surrounding road conditions, enabling rapid data analysis [1] Performance Impact - In pilot cities, the implementation of the "Tongda" model has resulted in approximately a 15% improvement in intersection traffic efficiency [1]
从马拉松到格斗大赛,人形机器人在教育行业的奇点时刻还有多远?
3 6 Ke· 2025-05-26 23:48
Core Insights - The humanoid robot and embodied AI industry is transitioning from laboratory experiments to large-scale applications, becoming a focal point for global technology and capital competition [1][2] - The humanoid robot market is currently valued at $3.28 billion globally, with projections indicating that the domestic market will exceed 20 billion yuan within three years [2] - Humanoid robots are recognized as a revolutionary product, akin to computers, smartphones, and electric vehicles, and are increasingly integrated with advanced technologies such as AI, high-end manufacturing, and new materials [1][2] Policy Landscape - Various policies have been established across provinces in China to support the development of humanoid robots, including the "14th Five-Year Plan for Robot Industry Development" and the "Guidance on the Innovation and Development of Humanoid Robots" [3] - The policy framework aims to foster innovation and collaboration in the humanoid robot sector, facilitating the establishment of innovation centers and strategic industry clusters [3] Development History - The evolution of humanoid robots can be categorized into several phases: early exploration (1969-2000), integrated development (2000-2015), dynamic and intelligent progress (2015-2022), and the current explosive growth phase (2022-present) [4][5][6][7] - Recent advancements in large-scale AI models and high-performance computing have significantly enhanced the capabilities of humanoid robots, enabling them to perform complex tasks and interact with their environment [7][8] Industry Chain Structure - The humanoid robot industry chain is structured into three levels: upstream core technologies, midstream manufacturing, and downstream application scenarios [12] - The core technologies include the "brain" (embodied intelligence), "cerebellum" (motion coordination), and "body" (biomimetic systems), which collectively enable humanoid robots to perceive, decide, and act [11][13][19] Current Market Trends - The market for humanoid robots is expanding from industrial applications to household services, with robots increasingly capable of performing complex tasks in domestic environments [23] - The rental market for humanoid robots is gaining traction, allowing companies to validate market demand and reduce user entry barriers by offering flexible usage options [23][24] Future Business Models - The future of humanoid robots is expected to evolve from pure hardware sales to a hybrid model combining hardware and services, including subscription-based services for businesses and consumers [24] - The competition in the humanoid robot industry is anticipated to shift towards an "ecosystem war," with companies like Tesla and NVIDIA developing comprehensive solutions that integrate hardware, AI, and manufacturing capabilities [24] Educational Applications - Humanoid robots are poised to play a significant role in education, particularly in specialized fields, with potential applications in rehabilitation and training [25][26] - The integration of humanoid robots in educational settings is currently in the validation stage, with a gradual transition towards deeper integration as technology matures and costs decrease [26][27] Future Trends - The deep integration of embodied intelligence with multimodal large models is expected to enhance the personalization and generalization of educational applications [34] - The development of simulation training platforms will accelerate the intelligent iteration of educational scenarios, allowing for low-cost testing and rapid deployment of humanoid robots in classrooms [35] - The emergence of new roles and industry reshuffling will create opportunities for algorithm developers focused on educational applications of humanoid robots [39][40]