大模型“天梯赛”来了，让Agent在Kaggle真实任务中进化｜佐治亚理工、斯坦福开源

Core Viewpoint - The article discusses the introduction of MLE-Dojo, an interactive framework designed to train and evaluate large language model (LLM) agents in machine learning engineering tasks, addressing the limitations of existing benchmarks that do not simulate real-world iterative workflows [1][2]. Group 1: Existing Problems and Solutions - Current benchmarks for LLMs are mostly static and fail to capture the dynamic workflows of machine learning engineering, lacking assessments of continuous experimentation and structured feedback [6]. - Many platforms do not support advanced training paradigms like supervised fine-tuning (SFT) or reinforcement learning (RL), limiting the development of more autonomous AI agents [7]. - Existing benchmarks often focus on isolated tasks, missing the complexity and interconnections of end-to-end machine learning processes, which MLE-Dojo aims to address by providing a comprehensive training and evaluation environment [8]. Group 2: MLE-Dojo Features - MLE-Dojo consists of over 200 real Kaggle competitions, covering various domains such as tabular data, computer vision (CV), and natural language processing (NLP), providing unprecedented breadth and depth for evaluating AI agents [12]. - The framework offers a Gym-style interactive environment where agents can perform actions like requesting task information, validating code, and executing code in a secure sandbox [13]. - MLE-Dojo provides advanced features such as detailed error reports and a HumanRank score, which measures the agent's relative position on human leaderboards, offering a standardized performance metric across tasks [14]. Group 3: Evaluation of LLMs - The research team evaluated eight leading LLMs using a multi-dimensional assessment system rather than relying on a single metric [16]. - The HumanRank score reflects the model's performance relative to human competitors, while the Elo rating system provides a dynamic ranking based on head-to-head match results [17][18]. - The AUP (Area Under the Performance Profile) metric assesses the robustness and consistency of models across various tasks, with higher scores indicating better performance stability [18]. Group 4: Performance Analysis - Gemini-2.5-Pro emerged as the top performer in the Elo rating, demonstrating strong competitive capabilities and surpassing 61.95% of human players in the HumanRank score [20]. - Different models exhibited distinct problem-solving strategies, with some being more aggressive in executing code while others were more conservative, impacting their efficiency and overall performance [23]. - The analysis revealed that stronger models tend to generate longer and more complex solutions, indicating deeper reasoning and multi-step problem-solving capabilities [24]. Group 5: Cost-Performance Trade-off - High-performing models often incur significant computational costs, with top reasoning models consuming more tokens and resources [25]. - Some models, like DeepSeek-r1, show potential for competitive performance with higher cost-effectiveness, indicating a direction for future model optimization [25].