推理模式

Search documents
爆冷!字节Seed 在CCPC 决赛只做出一道签到题,而DeepSeek R1 直接挂零?
AI前线· 2025-05-16 07:48
Core Viewpoint - The performance of large language models (LLMs) in algorithm competitions, specifically the China Collegiate Programming Contest (CCPC), has revealed significant limitations, indicating that while these models can excel in certain tasks, they struggle with unique and creative problem-solving required in competitive programming [10][11]. Group 1: Competition Overview - The 10th China Collegiate Programming Contest (CCPC) recently took place, with Byte's Seed sponsoring and participating through Seed-Thinking, which only managed to solve a simple "check-in" problem [1][3]. - The number of problems in the CCPC final typically ranges from 10 to 13, but specific details about this year's problems have not been disclosed [1]. Group 2: Model Performance - Various models, including Seed-Thinking, o3, o4, Gemini 2.5 Pro, and DeepSeek R1, participated in the competition, with results showing that most models struggled significantly, with DeepSeek R1 failing to solve any problems [5][9]. - The models' performances were evaluated against their expected capabilities based on previous ratings, with many participants expressing surprise at the low scores achieved by these models [3][11]. Group 3: Model Architecture and Training - Seed-Thinking employs a MoE architecture with 200 billion total parameters and 20 billion active parameters, integrating various training methods for STEM problems and logical reasoning [8]. - o3 features a specialized reasoning architecture with 128 layers of Transformer, while o4-mini is optimized for efficiency, reducing parameters significantly while maintaining performance [8]. - Gemini 2.5 Pro supports multi-modal inputs and has a large context window, allowing it to handle extensive documents and codebases [8]. Group 4: Insights on Model Limitations - The results from the CCPC indicate that large models have inherent weaknesses in solving algorithmic problems, which may not be adequately addressed by their training [10][11]. - The competitive programming environment requires unique problem-solving skills that differ from the models' training data, making it challenging for them to perform well [11][12]. Group 5: Comparative Analysis - A benchmark test conducted by Microsoft on various models showed that while all models performed well on known problems, their success rates dropped significantly on unseen problems, particularly in medium and hard categories [14][17]. - Models that utilized reasoning modes demonstrated superior performance compared to their base versions, highlighting the importance of reasoning capabilities in tackling complex algorithmic challenges [17][18].