我们对 Coding Agent 的评测，可能搞错了方向

Core Viewpoint - The evaluation of Coding Agents has been misdirected, focusing too much on outcomes rather than the adherence to process specifications, which is crucial for effective collaboration in software engineering [2][4][7]. Group 1: Issues with Current Evaluation Systems - User dissatisfaction with Coding Agents often stems from poor execution rather than inability to perform tasks, highlighting the need for adherence to explicit instructions and engineering norms [3][4]. - Current evaluation systems, such as SWE-bench verified, primarily focus on outcome-based metrics, neglecting the process and user experience, leading to a disconnect between evaluation and real-world usage [4][7]. Group 2: Introduction of OctoCodingBench - MiniMax has introduced a new evaluation set, OctoCodingBench, aimed at assessing whether Coding Agents follow rules during task completion, thus addressing the identified blind spots in existing evaluations [5][8]. - The evaluation metrics include Check-level Success Rate (CSR) and Instance-level Success Rate (ISR), which measure the proportion of rules followed and overall compliance, respectively [9][10]. Group 3: Evaluation Results - Even the strongest models fail to comply with process norms, with Claude 4.5 Opus achieving an ISR of only 36.2%, indicating significant room for improvement in process adherence [13]. - Open-source models are rapidly catching up to closed-source models, with MiniMax M2.1 and DeepSeek V3.2 showing competitive ISR scores of 26.1% and 26%, respectively, surpassing some established closed-source models [13][14]. Group 4: Future Directions - The next generation of Coding Agents should incorporate Process Supervision to enhance compliance with process specifications, as current models show a decline in adherence over longer tasks [15][16]. - The evolution of Coding Agents is shifting from merely producing runnable code to effectively collaborating under complex constraints, emphasizing the importance of process specification in their development [16][17][18].