Workflow
Grok-4登顶,Kimi K2非思考模型SOTA,豆包、DeepSeek新模型性能提升|xbench月报
红杉汇·2025-07-18 00:47

Core Insights - The article discusses the competitive landscape of AI large models, highlighting the recent release of xAI's Grok-4 and Kimi's K2 model, which have sparked a new wave of advancements in the field [1][4]. Model Performance Summary - Grok-4 achieved a significant score increase from 42.6 to 65.0 in the ScienceQA evaluation, marking a 50% improvement and surpassing OpenAI's o3 model to become the state-of-the-art (SOTA) model [4][8]. - Kimi K2, a non-thinking model, scored 49.6, placing it in the top ten, with a BoN (N=5) score of 73.0, indicating strong performance in multi-step reasoning tasks [11][24]. - OpenAI's o3-pro model scored 59.6, showing improvement over its predecessor, but with increased response time and API costs [11][25]. Cost and Efficiency Analysis - Grok-4 is noted for its competitive pricing at $15 per million tokens, significantly lower than o3-pro's $80, while maintaining high performance [15][21]. - Doubao-Seed-1.6 demonstrated a cost-effective model with a score of 56.6 and an output price of $1.1, making it one of the best value models [15][18]. - The analysis indicates a trend where longer reasoning times correlate with higher scores, with Grok-4 having the longest average response time of 227 seconds [17]. Model Innovations - Grok-4 incorporates advanced features such as real-time web retrieval and multi-agent collaboration for enhanced reasoning capabilities [23]. - Kimi K2 is recognized for its innovative training techniques, including the MuonClip optimizer and a comprehensive agent simulation pipeline, which contribute to its large parameter count and performance [24]. - OpenAI's o3-pro model has been optimized for scientific and programming tasks, showcasing improved reliability and reasoning capabilities [25]. Leaderboard Updates - The leaderboard reflects updates from 16 companies with 43 different model versions, maintaining a consistent ranking for major players like OpenAI, Google, and ByteDance [5][8]. - The leaderboard will continue to evolve with monthly updates, providing ongoing insights into model performance and capabilities [1][5].