多模态推理模型 - filings, earnings calls, financial reports, news

多模态推理模型

Search documents

视频理解新标杆，快手多模态推理模型开源：128k上下文+0.1秒级视频定位+跨模态推理

量子位· 2025-09-05 10:56

Core Viewpoint - Keye-VL 1.5, an advanced multimodal model developed by Kuaishou, has been open-sourced, showcasing significant improvements in video understanding and reasoning capabilities compared to its predecessor [1][4][6]. Group 1: Model Capabilities - Keye-VL 1.5 features enhanced temporal localization abilities, achieving 0.1-second precision in identifying when specific objects appear in videos [10][8]. - The model introduces a Slow-Fast dual encoding mechanism, allowing for a 128k long context window while balancing speed and detail [5][8]. - It has demonstrated superior performance in various benchmarks, scoring 73.0 in the Video-MME short video benchmark and leading in multiple other evaluation scenarios [6][18]. Group 2: Benchmark Performance - Keye-VL 1.5 outperforms other models such as Qwen2.5-VL 7B across several benchmarks, including OpenCompass and MMBench, achieving top scores in its class [19][21]. - In human-annotated metrics, Keye-VL 1.5 achieved an average score of 3.53, which is an improvement of 0.51 points over the preview version and surpasses competing models [24][25]. Group 3: Model Architecture - The architecture of Keye-VL 1.5 consists of a "Vision Transformer (ViT) + MLP projector + language decoder" structure, designed to capture global spatial relationships in video frames [27][28]. - The model employs a four-stage progressive pre-training pipeline, utilizing over 1 trillion tokens from diverse data sources to enhance its multimodal capabilities [39][41]. Group 4: Research and Development - The Keye team has made significant contributions to the field, presenting multiple research findings at top conferences, including advancements in multimodal reinforcement learning and visual language model governance frameworks [51][54]. - The team's efforts are focused on integrating visual, linguistic, and behavioral data to enhance cognitive and decision-making processes in AI applications [50].

Zhong Guo Ji Jin Bao· 2025-07-26 08:44

Core Insights - Jumpshare Star announced three significant developments: the launch of the new generation foundational model Step3, a strategic partnership with Shanghai State-owned Capital Investment Co., and the establishment of the Model Ecological Innovation Alliance with nearly ten chip manufacturers and computing power platforms [2][3][6]. Group 1: New Model Launch - The new foundational model Step3 is designed to balance intelligence and efficiency, aiming to create the most suitable model for the inference era and will be open-sourced to global enterprises and developers on July 31 [3]. - Step3 boasts a decoding efficiency that can reach up to 300% compared to DeepSeek-R1 on domestic chips, and it is compatible with all chip types [3]. Group 2: Strategic Partnership - The collaboration with Shanghai State-owned Capital Investment Co. marks a significant step in the commercialization of Jumpshare Star, focusing on capital linkage, ecosystem construction, business synergy, and application empowerment [6][9]. - Shanghai State-owned Capital Investment Co. has a registered capital of 10 billion yuan and is involved in strategic equity management and market-oriented investment projects [9]. Group 3: Ecosystem Alliance - The Model Ecological Innovation Alliance aims to enhance model adaptability and computing efficiency through collaborative innovation among foundational technology vendors [11]. - Initial members of the alliance include major companies such as Huawei Ascend, MuXi, and others, with the goal of providing efficient and user-friendly large model solutions [11][13].

中国基金报· 2025-07-26 08:31

Core Viewpoint - Jumpshare Star announced three significant developments: the launch of the new generation foundational model Step 3, a strategic partnership with Shanghai State-owned Capital Investment Co., and the establishment of the Model Ecological Innovation Alliance with nearly ten chip manufacturers and computing power platforms [1][7][14]. Group 1: New Generation Foundational Model - The new foundational model Step 3 is designed to balance intelligence and efficiency, aiming to create the most suitable model for the inference era and contribute a powerful multimodal inference model to the open-source community [1][2]. - Step 3 achieves a decoding efficiency that is up to 300% higher than DeepSeek-R1 on domestic chips, demonstrating significant advancements in system and architecture innovation [2]. - In distributed inference using NVIDIA Hopper architecture chips, Step 3 shows a throughput improvement of over 70% compared to DeepSeek-R1 [4]. Group 2: Strategic Partnership - The partnership with Shanghai State-owned Capital Investment Co. marks a significant step in Jumpshare Star's commercialization efforts, focusing on capital linkage, ecological construction, business collaboration, and application empowerment [7]. - Shanghai State-owned Capital Investment Co. is a large state-owned capital investment platform with a registered capital of 10 billion yuan, involved in strategic equity management and market-oriented investment [8]. Group 3: Commercialization Progress - Jumpshare Star has achieved commercial progress, with over half of domestic smartphone manufacturers collaborating with the company, and partnerships with Geely Auto for smart cockpit solutions [10]. - The company aims to achieve an annual revenue target of 1 billion yuan by 2025, driven by rapid growth in the first half of 2025 [10]. Group 4: Model Ecological Innovation Alliance - The Model Ecological Innovation Alliance, initiated by Jumpshare Star and nearly ten chip and infrastructure manufacturers, aims to enhance model adaptability and computing efficiency through collaborative innovation [14][15]. - Initial members of the alliance include Huawei Ascend, Muqi, and several other technology firms, with the goal of providing efficient and user-friendly large model solutions for enterprises and developers [14][15].

阶跃星辰发布新一代基模 Step 3，原生多模态推理模型，性能达到开源 SOTA

Founder Park· 2025-07-26 04:53

Core Viewpoint - The article discusses the launch of Step 3, a new generation foundational model by the company, aimed at enhancing intelligent applications and efficiency in the reasoning era, emphasizing the importance of meeting customer needs and real-world application scenarios [3][6]. Group 1: Step 3 Model Overview - Step 3 is positioned as the primary foundational model, designed for global enterprises and developers, and will be open-sourced on July 31 [3][20]. - The model features a total parameter count of 321 billion, with 38 billion active parameters, showcasing strong visual perception and complex reasoning capabilities [9]. - Step 3 aims to balance performance and cost, achieving state-of-the-art (SOTA) results in open-source multi-modal reasoning tasks [9][18]. Group 2: Technological Innovations - The model employs a Mixture of Experts (MoE) architecture, which allows for significant performance improvements while maintaining low operational costs [9][18]. - Step 3 has demonstrated a decoding efficiency that can reach up to 300% on domestic chips compared to previous models, and over 70% improvement in throughput on NVIDIA Hopper architecture [18][20]. Group 3: Industry Collaboration - The company has initiated the "MoXin Ecological Innovation Alliance" with leading chip and platform manufacturers to foster joint innovation across the model and chip industry [5][22]. - A strategic partnership with Shanghai State-owned Capital Investment Co., Ltd. has been established to enhance capital linkage and ecological business cooperation [5][22]. Group 4: Application and Market Focus - The company is focusing on key application scenarios such as automotive, mobile phones, and IoT devices, with significant collaborations with major domestic smartphone manufacturers and the automotive industry [23]. - The company aims to create scenario-based applications in vertical industries, collaborating with leading firms in finance, content creation, and retail [23].

大模型

多模态推理模型

Artificial Intelligence

Artificial Intelligence

Step 3

Step 3o Vision

Step - Audio 2

斯坦福最新！大模型的幻觉分析：沉迷思考=真相消失？

自动驾驶之心· 2025-06-19 10:47

Core Viewpoint - The paper explores the relationship between reasoning capabilities and hallucinations in multimodal reasoning models, questioning whether increased reasoning leads to decreased visual perception accuracy [2][3][37]. Group 1: Reasoning Models and Hallucinations - Multimodal reasoning models exhibit a tendency to amplify hallucinations as their reasoning capabilities improve, leading to potential misinterpretations of visual data [2][3][5]. - The study introduces a new metric, RH-AUC, to assess the balance between reasoning length and perception accuracy, indicating that longer reasoning chains may lead to increased hallucinations [4][30]. Group 2: Attention Mechanism and Performance - The attention mechanism in reasoning models shows a significant drop in focus on visual elements, leading to a reliance on language-based assumptions rather than visual evidence [5][18]. - Experiments reveal that reasoning models perform poorly on perception tasks compared to non-reasoning models, indicating that hallucination rates are higher in reasoning models regardless of their size [8][37]. Group 3: Training Paradigms and Data Quality - The paper identifies two main training paradigms: pure reinforcement learning (RL-only) and supervised fine-tuning combined with reinforcement learning (SFT+RL), with RL-only models generally performing better in balancing reasoning and perception [10][35]. - Data quality is emphasized over quantity, suggesting that models trained on high-quality, domain-specific data perform better in maintaining the reasoning-hallucination balance [39][42]. Group 4: Evaluation Metrics and Future Directions - The RH-Bench benchmark is introduced, consisting of 1000 multimodal tasks to evaluate models' reasoning and perception capabilities comprehensively [30][32]. - Future research directions include exploring broader model architectures and developing mechanisms for dynamically adjusting reasoning lengths to enhance model reliability [44].