多模态领域
Search documents
晚点独家丨爱诗科技完成 1 亿元 B+ 轮新融资,ARR 突破 4000 万美元
晚点LatePost· 2025-10-17 07:29
Core Insights - The article discusses the competitive landscape of AI video generation, highlighting the rapid growth and potential of companies like Aishi Technology and OpenAI's Sora [5][7][11]. Company Developments - Aishi Technology has completed a B+ round financing of 100 million RMB, bringing its total funding to over 100 million USD since its establishment in April 2023 [5]. - Aishi's products, PixVerse and Pai Wo AI, have over 100 million total users and a monthly active user count exceeding 16 million, with an annual recurring revenue (ARR) of 40 million USD [5]. - OpenAI launched the Sora 2 video generation model and Sora App, which quickly topped the US App Store free chart and surpassed 1 million downloads in less than two weeks [8][13]. Market Dynamics - The video generation app market is vast, with existing tools unable to cover all users, as evidenced by TikTok and Douyin's monthly active users exceeding 2 billion [9]. - Aishi's CEO noted that the emergence of AI is reshaping content consumption, similar to the impact of short videos [8]. - Despite Sora's rapid growth, Aishi's PixVerse has not been negatively impacted, indicating a large market capacity for multiple players [9]. Competitive Landscape - The current leading models in video generation are dominated by Chinese companies, with Kuaishou's Kling, Aishi's PixVerse, and MiniMax ranking in the top three, while Sora ranks 31st [11]. - ByteDance's video generation models, Seedance and Waver, are also strong competitors, with significant daily active user growth targets [12]. - The competition in the multi-modal field is intensifying, driven by the enormous consumer and entertainment potential [13].
MiniMax开源首个视觉RL统一框架,闫俊杰领衔!推理感知两手抓,性能横扫MEGA-Bench
量子位· 2025-05-27 12:31
Core Insights - The article discusses the introduction of the V-Triune framework by MiniMax, which allows for unified learning of visual reasoning and perception tasks within a single reinforcement learning (RL) system [1][11] - The framework addresses the limitations of traditional RL methods that typically focus on either reasoning or perception tasks, enabling a more comprehensive approach to visual tasks [2][8] Framework and Model Development - V-Triune employs a three-layer component design and a dynamic Intersection over Union (IoU) reward mechanism to effectively balance multiple tasks [2][22] - The Orsta model series, developed based on V-Triune, ranges from 7 billion to 32 billion parameters and has shown significant performance improvements in the MEGA-Bench Core benchmark, with enhancements ranging from +2.1% to +14.1% [3][30] Technical Implementation - The framework allows for sample-level data formatting, enabling custom reward settings and verifiers for each sample, thus supporting dynamic routing and weight adjustments [13][14] - An asynchronous client-server architecture is utilized to decouple reward calculation from the main training loop, enhancing flexibility in task expansion and reward logic updates [15][18] Monitoring and Stability - The system includes a monitoring mechanism that tracks various metrics such as reward values, IoU, mean Average Precision (mAP), response length, and reflection rates to ensure learning stability [19][21] - Dynamic IoU rewards are introduced to alleviate cold start issues and guide models in improving localization accuracy through phased threshold adjustments [22][24] Performance Metrics - The Orsta models have been trained on a diverse dataset covering four types of reasoning tasks and four types of perception tasks, leading to significant improvements in performance metrics, particularly in perception tasks [30][31] - The article highlights the effectiveness and scalability of the unified approach, as evidenced by the substantial gains in mAP metrics during testing [30] Company Background - MiniMax, recognized as one of the "Six Little Giants" in AI, has been actively expanding its capabilities in the multimodal field, developing models that span language, audio, and video [32] - The company aims to innovate in multimodal architecture, focusing on a unified generative understanding model [35]