MiniMax发布M2.5模型:1美元运行1小时,价格仅为GPT-5的1/20,性能比肩Claude Opus
硬AI·2026-02-13 13:25

Core Viewpoint - MiniMax has launched its latest M2.5 model series, achieving a significant breakthrough in both performance and cost, aiming to address the economic feasibility of complex agent applications while claiming to have reached or refreshed the industry SOTA (state-of-the-art) levels in programming, tool invocation, and office scenarios [3][4]. Cost Efficiency - The M2.5 model demonstrates a substantial price advantage, costing only 1/10 to 1/20 of mainstream models like Claude Opus, Gemini 3 Pro, and GPT-5 when outputting 50 tokens per second [3][4]. - In a high-speed environment of 100 tokens per second, the cost for continuous operation for one hour is just $1, and it can drop to $0.3 at 50 tokens per second, allowing a budget of $10,000 to support four agents working continuously for a year [3][4]. Performance Metrics - M2.5 has shown strong performance in core programming tests, winning first place in the Multi-SWE-Bench multi-language task, with overall performance comparable to the Claude Opus series [4]. - The model has improved task completion speed by 37% compared to the previous generation M2.1, with an end-to-end runtime reduced to 22.8 minutes, matching Claude Opus 4.6 [4]. Internal Validation - Internally, MiniMax has validated the M2.5 model's capabilities, with 30% of overall tasks autonomously completed by M2.5, covering core functions such as R&D, product, and sales [4]. - In programming scenarios, M2.5-generated code accounts for 80% of newly submitted code, indicating high penetration and usability in real production environments [4]. Task Efficiency - M2.5 aims to eliminate cost constraints for running complex agents by optimizing inference speed and token efficiency, achieving a processing speed of 100 TPS (transactions per second), approximately double that of current mainstream models [7]. - The model has reduced the total token consumption per task to an average of 3.52 million tokens in SWE-Bench Verified evaluations, down from 3.72 million in M2.1, allowing for nearly unlimited agent construction and operation economically [9]. Programming Capability - M2.5 emphasizes not only code generation but also system design capabilities, evolving a native specification behavior that allows it to decompose functions, structures, and UI designs from an architect's perspective before coding [11]. - The model has been trained in over 10 programming languages, including GO, C++, Rust, and Python, across tens of thousands of real environments [12]. Testing and Validation - M2.5 has been tested on programming scaffolds like Droid and OpenCode, achieving pass rates of 79.7% and 76.1%, respectively, outperforming previous models and Claude Opus 4.6 [14]. Advanced Task Handling - In search and tool invocation, M2.5 exhibits higher decision maturity, seeking more streamlined solutions rather than merely achieving correctness, saving approximately 20% in rounds consumed compared to previous generations [16]. - For office scenarios, M2.5 integrates industry-specific knowledge through collaboration with professionals in finance and law, achieving an average win rate of 59.0% in comparisons with mainstream models, capable of producing industry-standard reports, presentations, and complex financial models [18]. Technical Foundation - The performance enhancement of M2.5 is driven by large-scale reinforcement learning (RL) through a native Agent RL framework named Forge, which decouples the underlying training engine from the agent, supporting integration with any scaffold [23]. - The engineering team has optimized asynchronous scheduling and tree-structured sample merging strategies, achieving approximately 40 times training acceleration, validating a near-linear improvement in model capabilities with increased computational power and task numbers [23]. Deployment - M2.5 is fully deployed in MiniMax Agent, API, and Coding Plan, with model weights to be open-sourced on HuggingFace, supporting local deployment [25].