CodeAgent 2.0 时代开启｜GitTaskBench，颠覆性定义代码智能体实战交付新标准

Core Insights - The article discusses the limitations of current AI coding benchmarks, which primarily focus on code generation and closed problems, neglecting real-world developer needs such as environment setup and dependency management [2] - A new evaluation paradigm called GitTaskBench has been proposed by researchers from various prestigious institutions, aiming to assess the full lifecycle capabilities of code agents from repository understanding to project delivery [2][5] - GitTaskBench incorporates economic benefits of "framework × model" into its evaluation metrics, providing valuable insights for academia, industry, and entrepreneurs [2] Evaluation Framework - GitTaskBench covers 7 modalities across 7 domains, with 24 subdomains and 54 real tasks, utilizing 18 backend repositories with an average of 204 files, 1,274.78 functions, and 52.63k lines of code [3] - Each task is linked to a complete GitHub repository, natural language instructions, clear input/output formats, and task-specific automated evaluations [4] Capability Assessment - GitTaskBench evaluates code agents on three dimensions: autonomous environment setup, overall coding control, and task-oriented execution [8][9] - The evaluation process includes repository selection, completeness verification, execution framework design, and automated assessment [10] Economic Feasibility - The concept of "cost-effectiveness" is introduced, quantifying the economic viability of agent solutions through metrics that reflect cost savings and efficiency improvements [12][13] - The average net benefit (α value) of agents is calculated based on task completion, market value, quality coefficient, and operational costs [15] Performance Results - The performance of various frameworks and models is analyzed, revealing that OpenHands achieved the highest execution completion rate (ECR) of 72.22% and task pass rate (TPR) of 48.15% [15][16] - GPT-4.1 demonstrated a strong performance with lower costs compared to Claude models, indicating a balance between effectiveness and cost [24] Market Value Insights - The article highlights that tasks with higher human market values yield greater positive alpha returns when successfully completed by agents [18] - Conversely, tasks with lower market values, such as image processing, can lead to negative alpha if operational costs exceed certain thresholds [19][20] Conclusion - The choice of "framework × model" should consider effectiveness, cost, and API usage, with Claude series excelling in code tasks while GPT-4.1 offers cost-effective and stable performance [24] - GitTaskBench can be utilized in various application scenarios, aiding in the evaluation of code agents across multiple modalities [25]