苹果AI论文太坑了！用GPT写的GT，导致北京程序员通宵加班

Core Viewpoint - The article discusses a significant incident involving a paper from Apple that was found to have serious flaws, including a Ground Truth (GT) error rate potentially as high as 30%, leading to a researcher publicly calling for its retraction [10][21][31]. Group 1: Incident Overview - The incident began when a researcher from the company, Lei Yang, was excited to adapt a benchmark from an Apple paper that aligned with his recent research [2][12]. - After working on the adaptation, he discovered that the benchmark claimed to outperform GPT-5 but had a substantial GT error rate and official code bugs [3][21]. - Lei Yang's attempts to fix the bugs resulted in even lower performance metrics, prompting him to investigate the errors in the GT data [17][19]. Group 2: Research Findings - Upon reviewing the errors, Lei Yang found that 6 out of 20 questions he checked were clearly incorrect due to issues in the GT data, which seemed to be poorly quality-checked [19][20]. - This led him to estimate that the GT error rate could be as high as 30%, raising concerns about the integrity of the data used in the paper [21][22]. Group 3: Response and Retraction - After reporting the issues to the authors, Lei Yang received a brief response, and the issue was closed without proper resolution [23][25]. - Following his public comments highlighting the data quality issues, the authors eventually retracted the paper and removed the associated GitHub repository [31][32]. - The authors acknowledged the oversight in data quality and expressed regret for their initial handling of the feedback [37][39].