Core Insights - The article discusses the latest updates from xbench regarding various AI models, particularly focusing on the BabyVision benchmark and the competitive landscape among leading models [1][14]. Group 1: Model Performance and Rankings - The latest leaderboard updates show that Doubao-Seed-2.0-pro ranks first among domestic models with an average score of 69.2, significantly outperforming its competitors in terms of output token cost, which is only one-fourth of Gemini 3 Pro's cost [5]. - Qwen3.5-plus achieved a score of 65.6, marking a notable improvement of 10.6 points from its predecessor, indicating a shift in focus towards stability and cost-effectiveness in model performance [7]. - GLM-5 scored 65.0, reflecting a 4.2 point increase from GLM-4.7, while maintaining high inference efficiency [8][9]. Group 2: Benchmarking and Evaluation - The BabyVision benchmark, developed by xbench in collaboration with various AI companies and researchers, has been adopted by several new models, showcasing its relevance in the industry [14]. - Doubao-Seed-2.0-pro leads the BabyVision leaderboard with a score of 62.60%, demonstrating its strong capabilities in multimodal visual understanding tasks [12]. - The competitive landscape is evolving, with models increasingly focusing on real-world agent tasks rather than just single-point benchmarks [28]. Group 3: Technological Advancements - Seed2.0, launched by ByteDance, enhances visual perception and reasoning capabilities, significantly improving the processing of complex documents and multimedia content [29][30]. - Qwen3.5 incorporates a hybrid attention mechanism and a sparse architecture, allowing for efficient deployment and improved inference throughput [33]. - GLM-5 introduces advanced capabilities in automated code generation and complex system reconstruction, marking a significant evolution in AI model functionality [34].
榜单更新,字节Seed2.0表现亮眼,我们还测了爆火的龙虾 |xbench 月报