多模态大模型
Search documents
首个多轮、开放视角视频问答基准,系统分类9大幻觉任务
3 6 Ke· 2025-12-26 07:16
为填补这一空白,来自国防科技大学与中山大学的研究团队提出了WildVideo,一个面向真实世界视频-语言交互的、系统性的多轮开放问答评测基准。 新智元报道 【导读】基准WildVideo针对多模态模型在视频问答中的「幻觉」问题,首次系统定义了9类幻觉任务,构建了涵盖双重视角、支持中英双语的大规模高 质量视频对话数据集,采用多轮开放问答形式,贴近真实交互场景,全面评估模型能力。 近年来,大模型在多模态理解领域进展显著,已能够在开放世界中处理图文甚至视频内容。 然而,一个普遍且严重的问题「幻觉」始终制约着其实际应用。 尤其在动态、连续的视觉场景中,模型可能生成与视频内容矛盾、违背常识或在多轮对话中前后不一致的回答。 当前主流评测基准多集中于单轮、单视角、选择题型的设定,难以真实反映模型在开放、连续、交互式对话场景中的能力与缺陷。这一评测体系的局限, 阻碍了我们对模型在实际应用中表现的理解与优化。 论文地址:https://ieeexplore.ieee.org/document/11097075 项目主页:https://chandler172857.github.io/WildVideo-leaderboard ...
字节AI1080天闪电逆袭:从后知后觉到AGI全面发力
2 1 Shi Ji Jing Ji Bao Dao· 2025-12-25 03:56
Core Insights - ByteDance has successfully transformed its AI strategy over three years, evolving from a position of initial lag to establishing a comprehensive stack of AI capabilities [1][2][11] - The company has made significant strides in the AI sector, particularly in generative AI, by restructuring its organization and focusing on application-driven development [2][5][11] Group 1: Initial Challenges and Strategic Shift - In early 2023, ByteDance faced anxiety over its lag in generative AI, lacking a unified model strategy and having multiple business lines operating independently [2][4] - The turning point came in March 2023 with the release of GPT-4, prompting ByteDance's leadership to recognize the urgency of catching up and to establish a "large model initiative" [5][11] - By the end of 2023, ByteDance had increased its investment in AI resources, focusing on computing power and talent acquisition, although initial product releases were limited [5][6] Group 2: Organizational Restructuring - In early 2024, ByteDance underwent a significant AI organizational restructuring, elevating AI from a support function to a primary strategic focus [5][8] - The restructuring involved the creation of two independent units: Seed (focused on foundational model research) and Flow (dedicated to AI product innovation), both reporting directly to senior management [5][8] Group 3: Product Development and Market Impact - The Seed team, led by notable AI expert Wu Yonghui, established a technology route prioritizing multi-modal capabilities and efficient deployment [6][8] - The Flow team adopted a "special forces" model for product development, resulting in the launch of several successful applications, including Doubao, which achieved a monthly active user count of 75.23 million by December 2024 [8][9] - ByteDance's AI-native applications collectively surpassed 120 million monthly active users, reflecting a year-on-year growth of 232% [9] Group 4: Future Directions and Innovations - In 2025, ByteDance aims to solidify its position in the AI landscape by focusing on advanced technologies, global expansion, and hardware integration [9][10] - The return of founder Zhang Yiming to lead the Singapore AI lab signifies the strategic importance of AI within the company, with a focus on multi-modal models and self-developed AI chips [9][10] - ByteDance is also accelerating its hardware initiatives, including collaborations on AR glasses and AI headphones, enhancing its ecosystem of AI capabilities [10]
理想MindGPT-4o-Vision技术报告压缩版
自动驾驶之心· 2025-12-25 03:24
Core Insights - The article discusses the release of the MindGPT-4ov technology report by Li Auto, highlighting the trade-offs between general capabilities and vertical domain adaptation in multi-modal large models [1] Group 1: Challenges in Multi-Modal Model Training - Three key inefficiencies and biases in current multi-modal model training are identified: 1. Resource allocation is inefficient, treating all data equally and neglecting high-value data, leading to wasted computational resources [2] 2. A reward mechanism that causes diversity collapse, where models converge to a few safe response patterns, sacrificing output diversity and generalization ability [2] 3. Unimodal spurious correlations, where models overly rely on prior knowledge from language models rather than visual evidence, leading to factual errors in industrial applications [2] Group 2: MindGPT-4ov Training Paradigm - The MindGPT-4ov post-training paradigm consists of four core modules: 1. Data construction based on Information Density Score (IDS) and a dual-label system [3] 2. Supervised fine-tuning (SFT) through collaborative curriculum SFT [3] 3. Reinforcement learning (RL) with a hybrid reward mechanism [3] 4. Infrastructure improvements for parallel training and inference optimization [3] Group 3: Information Density Score (IDS) and Data Synthesis - IDS evaluates image data across four dimensions: subject diversity, spatial relationships, OCR text richness, and world knowledge relevance [3] - A dynamic synthesis strategy adjusts the number of generated question-answer pairs based on IDS scores, optimizing resource allocation [3] Group 4: Supervised Fine-Tuning (SFT) Mechanism - The SFT mechanism employs a three-stage collaborative curriculum learning approach to address the conflict between knowledge injection and capability retention: 1. Cross-domain knowledge learning focuses on injecting vertical domain knowledge [5] 2. Capability restoration uses general datasets to recover potential declines in general capabilities [5] 3. Preference alignment optimizes response formats and reduces hallucinations using high-quality preference data [5] Group 5: Reinforcement Learning with Hybrid Rewards - The RL phase introduces multiple reward signals to balance accuracy, diversity, and conciseness: 1. Pass@k rewards encourage exploration of different reasoning paths by rewarding any correct answer among the top k responses [6] 2. Diversity rewards penalize semantically similar responses, promoting varied outputs [6] 3. Length rewards impose penalties for overly long responses, ensuring concise outputs [6] Group 6: Label Construction and Data Admission - A hierarchical labeling system is established, with experts defining primary labels and MLLM generating secondary and tertiary labels to form a comprehensive knowledge tree [7] - Data synthesis involves matching images with coarse and fine-grained topics, generating QA pairs based on IDS scores, and filtering low-quality data through a multi-model voting mechanism [7] Group 7: Performance Metrics - MindGPT-4ov demonstrates significantly shorter average response lengths compared to competing models while maintaining higher accuracy (83.3% vs 80.1%), validating the effectiveness of the length reward mechanism [8]
都是TOP人才!跑遍全球,和机器之心共聚AI学术顶会
机器之心· 2025-12-23 09:36
2025 年,AI 依然在加速奔跑。从多模态大模型到智能体系统的演进,从基础理论的突破到产业应用的深化,技术的每一次跃迁,都在重塑未来的轮廓。在海量 学术成果爆发的背景下,单纯的阅读已难以追赶技术的迭代速度,我们笃信——再强大的算法,也需要人与人的连接;再前沿的突破,也需要面对面的对话。 今年,带着这份相信,我们出发了。从北京的四季轮转到江南的桂香满庭,从新加坡的星洲夜语到维也纳的夏风微拂,从温哥华的学术静谧到圣地亚哥的海边星 光……我们围 绕 ICLR、CVPR、ACL、ICML、IROS、EMNLP、NeurIPS 等 AI 学术会议,跨越 8 座城市,落地 11 场活动。 在时差交替的版图上,我们找到了共同的频率,写下了这些属于 2025 的记忆与数字: 2025,精彩回顾 从论文的深度解读,到人才晚宴上的热烈交谈,"论文分享会"与"人才 Meetup"两大系列活动,贯穿全年,覆盖海内外,旨在打造一个 有温度、有深度、也有价值 的 AI 交流生态圈: 2026,继续出发 旧章已谱,新篇待书。2025 年的圆满收官,是 2026 年更精彩旅程的起点。我们已经初步规划了覆盖 ICLR、CVPR、ACL、IC ...
智谱等2家企业完成境外上市备案
Sou Hu Cai Jing· 2025-12-23 06:15
Group 1 - The China Securities Regulatory Commission has confirmed the overseas listing of two companies, Zhipu and MiniMax, both of which are preparing for a listing in Hong Kong [1][3] - Zhipu plans to issue no more than 43,032,400 shares of overseas listed common stock and will be listed on the Hong Kong Stock Exchange [1] - Zhipu focuses on the research and development of cognitive intelligent large models, with its core business revolving around the development of general large models, service provision, and technology open-sourcing [3] Group 2 - MiniMax intends to issue no more than 33,577,240 shares of overseas listed common stock and will also be listed on the Hong Kong Stock Exchange [1] - MiniMax is engaged in the research and commercialization of multimodal large models, covering various areas such as text generation, speech synthesis, video generation, virtual characters, agents, and multimodal interaction platforms [4] - As of September 30, 2025, MiniMax has over 212 million individual users across more than 200 countries and regions, as well as 130,000 enterprise customers from over 100 countries [4]
海外市场收入贡献占比超70% MiniMax何以用385人“小团队”撬动全球AGI市场?
Mei Ri Jing Ji Xin Wen· 2025-12-21 14:49
Core Insights - MiniMax is poised to set a record for the fastest IPO from establishment to listing among AI companies, having been founded only four years ago [1] - The company aims to be the "first global AGI stock" listed on the Hong Kong Stock Exchange, showcasing its technological advancements and global market reach [2] Technological Advancements - MiniMax has developed a multi-modal general model matrix, with its open-source text model M2 ranking in the top five globally and first in the open-source category according to the Artificial Analysis (AA) evaluation [2][14] - The company has launched a series of AI-native products, including MiniMax Agent and Talkie, which support multiple languages and cater to both consumer and business markets [3][7] Global Market Strategy - MiniMax has established a global presence, serving over 200 million individual users and 100,000 enterprises across more than 200 countries and regions [3] - The company reported a revenue growth of over 170% year-on-year for the first nine months of 2025, with over 70% of its revenue coming from international markets [3] Business Model and Revenue Streams - MiniMax has diversified its revenue channels, including subscription services, in-app purchases, and enterprise APIs [9] - The number of paid users for its AI-native products is projected to grow from approximately 119,700 in 2023 to about 177,160 by September 30, 2025 [8] Team and Management - The company has a youthful workforce, with an average employee age of 29 and 74% of its staff in research and development roles [11] - The flat management structure has contributed to its rapid technological advancements and global product development [11] Vision and Future Outlook - MiniMax's founder emphasized the importance of serving customers directly and maintaining a technology-driven approach as core principles for the company's growth [6] - The company aims to make AGI accessible and beneficial to the public, moving beyond theoretical models to practical applications [15][16]
豆包大模型日均token用量破50万亿后,火山引擎将主战场押注Agent
Tai Mei Ti A P P· 2025-12-19 10:05
Core Insights - The release of Doubao Model 1.8 and Seedance 1.5 pro marks a significant update in AI capabilities, particularly in multi-modal understanding and Agent functionalities [2][4] - Doubao Model 1.8 has achieved a daily token usage of over 50 trillion, a tenfold increase from the previous year, with over 100 enterprise clients utilizing more than 1 trillion tokens [2][5] - The advancements in Agent capabilities are seen as a pivotal development, allowing for complex applications in enterprise scenarios [4][7] Group 1: Model Updates - Doubao Model 1.8 has significantly improved its tool-calling ability, allowing for the simultaneous use of over 20 tools, reducing planning steps by 37% and increasing execution success rates by 21% [5] - The model has enhanced capabilities in visual understanding, long video comprehension, and document structuring, along with native support for intelligent context management [5][6] - Seedance 1.5 pro is designed to meet the growing demand for video creation, featuring cinematic narrative tension and breakthroughs in audio-visual synchronization technology [2][5] Group 2: Industry Trends - The industry is still in its early stages, with ongoing technical limitations, but there is a strong demand for multi-modal models [3][7] - The Agent era is expected to continue its growth, with predictions of enterprises utilizing 50 to 200 Agents by 2025, necessitating improved management and operational capabilities [10] - Key sectors such as internet, retail, automotive, and education are rapidly adopting Agent technologies, while traditional industries are slower but have high potential [7][10] Group 3: Competitive Landscape - Major players like Anthropic, Google, and OpenAI are refining their models to enhance practical applications, with a focus on economic value and real-world utility [8][10] - The competition among large model vendors is anticipated to intensify as the Agent capabilities become more critical in the market [10]
火山引擎总裁谭待:谈论Agent与APP冲突还太早
第一财经· 2025-12-19 06:51
Core Insights - The article discusses the recent advancements in AI models by ByteDance's Volcano Engine, highlighting the launch of Doubao Model 1.8 and Seedance 1.5 pro, with Doubao's daily token usage exceeding 50 trillion, up from 30 trillion in September [2]. Group 1: AI Model Developments - Doubao Model's daily token usage has significantly increased, indicating growing adoption and demand for AI solutions [2]. - The industry is still in the early stages of AI implementation, with the transition from the APP era to the Agent era being characterized as a conflict of perspectives rather than a definitive shift [2][3]. - The core value of AI lies in optimizing unmet needs and enhancing efficiency, rather than merely replacing existing platforms [2]. Group 2: Challenges and Ecosystem Readiness - The exploration of AI and Agents is still in a trial phase, with market demand present but models not yet fully developed, a situation expected to persist for about three more years [3]. - The readiness of the ecosystem for comprehensive Agent integration is contingent on the improvement of Agent tools [3][4]. - Key challenges for Agents include foundational capabilities and real-world application requirements, such as stability, scalability, and data security [4]. Group 3: Multi-Modal AI and Future Trends - The introduction of multi-modal capabilities in AI models allows them to perform tasks similar to human functions, marking a shift towards deeper application scenarios [4]. - The rapid evolution of models is addressing many issues, with significant advancements made since last year [4]. - The competition among AI firms should focus on expanding the market and accelerating AI implementation across various industries [4]. Group 4: Cloud Services and Market Dynamics - Volcano Engine emphasizes the value of cloud services in the AI era, drawing parallels between the growth of AI cloud services and the GPU market surpassing CPUs [5]. - The shift towards AI-driven cloud services is expected to render traditional private deployment models obsolete, as the technology continues to evolve rapidly [5]. - The importance of cloud infrastructure is underscored by the challenges faced by fixed-capacity machines in supporting diverse AI applications [5].
AI 时代,如何定义电商营销新范式
Sou Hu Cai Jing· 2025-12-19 03:08
Core Viewpoint - The e-commerce industry is undergoing a significant transformation through AI, with Douyin e-commerce leading the way by launching "Qianchuan·Chengfang," which simplifies operations for merchants and enhances user experience, ultimately achieving a win-win for merchants, users, and the platform [3][4][7]. Group 1: AI Transformation in E-commerce - The e-commerce sector has long discussed AI transformation, but actual implementations have been limited to isolated capabilities like "recommended for you" and "image search," lacking a comprehensive system-level upgrade [2]. - The breakthrough in AI application in e-commerce is attributed to the availability of abundant data and mature technology, which Douyin e-commerce currently possesses [4][7]. - Douyin's internal data shows significant engagement, with 11.6 billion daily views of e-commerce short videos and 4.86 billion views of user-generated content, indicating a strong foundation for AI-driven marketing [4]. Group 2: Key Technological Breakthroughs - Three major technological advancements have enabled AI to play a central role in e-commerce: the integration of agent capabilities with reinforcement learning, the maturity of model control technologies like MCP, and the successful deployment of multimodal large models [5][6]. - The agent system allows AI to make real-time decisions based on various performance metrics, optimizing budget allocation dynamically [5]. - The MCP technology enables large models to operate marketing tools more effectively, reducing the need for manual intervention [6]. Group 3: Components of Qianchuan·Chengfang - Qianchuan·Chengfang consists of three main components: Qianxun, Qianshe, and Qianyi, each addressing different aspects of e-commerce marketing [8][9]. - Qianxun focuses on accurately predicting user needs and personalizing recommendations by integrating content, products, and user data [8]. - Qianshe automates marketing strategy formulation, allowing merchants to input basic parameters and receive optimized marketing plans without needing extensive expertise [16]. - Qianyi enhances dynamic content generation and customer service, enabling real-time adjustments based on user interactions and feedback [20][23]. Group 4: Implications for Merchants - The introduction of Qianchuan·Chengfang significantly lowers the marketing barrier for small and medium-sized enterprises, allowing them to focus on product quality rather than complex marketing strategies [24]. - Larger businesses can save time and resources, enabling them to concentrate on innovation and brand development [24]. - The combination of Qianxun, Qianshe, and Qianyi represents a shift towards a more efficient and effective e-commerce marketing landscape, driven by AI [24][25].
火山引擎总裁谭待:谈论Agent与APP冲突还太早
Di Yi Cai Jing· 2025-12-18 15:26
Core Insights - ByteDance's cloud platform Volcano Engine has released the Doubao model 1.8 and the Seedance 1.5 pro audio-video creation model, with Doubao's daily token usage exceeding 50 trillion, up from 30 trillion in September [2] - The industry views the targeted restrictions on internet apps as a conflict between the "Agent era and the APP era," but the president of Volcano Engine, Tan Dai, believes that the core value for users lies in achieving goals more conveniently and at lower costs, regardless of the medium used [2] - Tan Dai emphasizes that AI's primary role should be to optimize the efficiency of unmet needs, suggesting a coexistence of Web, APP, and Agent rather than a replacement [2] Industry Readiness - The exploration of AI and Agents is still in a trial phase, with market demand present but models not yet fully developed, a situation expected to last for about three more years [3] - The core issue regarding the industry's readiness for Agent integration lies in the improvement of Agent tools, with Volcano Engine investing significant resources to make existing functions recognizable and callable by Agents [3] - Tan Dai notes that both Doubao AI assistants and APPs consist of complex Agent collections, facing challenges in foundational capabilities and real-world application requirements [3] Multi-Modal Models - By the end of 2025, leading domestic and international model manufacturers are intensifying efforts, with multi-modal models like Seedance 1.5 pro marking a shift towards deeper AI applications [4] - Multi-modal capabilities allow models to "see, hear, speak, and act," moving beyond text-based interactions to practical applications such as traffic recognition and quality inspection [4] - Tan Dai believes that while multi-modal models face data challenges, significant progress has been made compared to last year, and the pace of model advancement is rapid [4] Cloud Services in AI Era - Volcano Engine continues to highlight the value of cloud services in the AI era, with AWS aiming for its generative AI platform Bedrock to become the "largest reasoning engine globally," comparable to its core computing service EC2, which is currently valued at around $40 billion [4] - Tan Dai acknowledges this trend and compares the development of MaaS (Model as a Service) to the chip business, indicating a shift from GPU training to inference processes [4] Future of AI Hardware - Tan Dai cites the early 2025 AI wave as evidence of the importance of cloud business, noting that many users faced issues with fixed-capacity AI hardware due to rapid technological iterations [5] - The inability to privatize deploy technologies like Agents and the fixed capabilities of one-machine solutions hinder the successful implementation of diverse AI applications [5] - Consequently, the private one-machine model from the software era is expected to be phased out in the AI era [5]