Workflow
最佳N选1策略
icon
Search documents
68页论文再锤大模型竞技场!Llama4发布前私下测试27个版本,只取最佳成绩
量子位· 2025-05-02 04:36
Core Viewpoint - The credibility of large model rankings, particularly the Chatbot Arena, has been called into question due to systemic issues highlighted in a recent paper titled "The Leaderboard Illusion" [2][3]. Group 1: Issues Identified - The paper identifies four main issues with the current ranking system [8]. - First, selective reporting and private testing by major model providers (e.g., Meta, Google, Amazon) allow them to only disclose the best-performing versions of their models [10][11]. - This "best N out of 1" strategy inflates rankings, as testing multiple variants can significantly increase expected scores [13][14]. - Second, data access inequality exists, with major providers receiving a disproportionate amount of user feedback compared to open-source models [23]. - Third, the use of Arena data for training can lead to significant performance improvements, with a noted increase in win rates when training data usage rises [24][25]. - Fourth, many models are "silently deprecated," with 205 out of 243 public models being effectively abandoned, which undermines the reliability of rankings [27][28]. Group 2: Recommendations and Responses - The research team provided five improvement suggestions to enhance the ranking system's credibility [30]. - The official response from LMArena acknowledged some issues but defended the ranking system's integrity, emphasizing that it reflects community preferences [6][34]. - Alternative platforms like OpenRouter are suggested as potential options for more reliable model comparisons [36][37]. - The paper's findings have prompted a reconsideration of relying solely on one ranking system, highlighting the need for diverse benchmarks [35].