Workflow
OIBench
icon
Search documents
打破大模型编程「数据污染」与「能力虚胖」困境,Meituan-M17团队构建新一代AI编程评测新标准——OIBench
机器之心· 2025-07-11 02:43
Core Insights - The article highlights the significant gap between the proclaimed capabilities of large language models (LLMs) in programming and their actual performance in rigorous evaluations, indicating a "cognitive gap" between marketing claims and reality [3][28]. Evaluation Framework - The Meituan-M17 team developed the OIBench dataset to provide a more accurate and differentiated assessment of LLMs' programming abilities, addressing the limitations of existing evaluation systems [3][8]. - OIBench consists of 212 high-difficulty algorithm problems, specifically designed to avoid data leakage and ensure high-quality assessments [10][11]. Model Performance - The evaluation of 18 mainstream models revealed that even the top-performing model, o4-mini-high, scored only 36.35, indicating a substantial gap from human competition levels [5][19]. - Many models, such as GPT-4o and Claude 3.5 Sonnet, demonstrated low success rates on complex problems, highlighting the limitations of their capabilities [4][19]. Comparison with Human Competitors - OIBench innovatively compared model performance with that of human competitors from top universities, providing more reliable and reproducible data than traditional Elo rating systems [24][23]. - The results showed that models like o4-mini-high performed better than 42% of human competitors, but overall, many models struggled to surpass even 20% of human participants [30][31]. Future Directions - The article emphasizes the need for ongoing collaboration between academia and industry to enhance the evaluation of LLMs and their integration into real-world applications [28][34]. - The introduction of a new competition focusing on human-machine collaboration aims to bridge the gap between current evaluation methods and practical applications in software development [39].