苹果AI论文太坑了，用GPT写的GT，导致北京程序员通宵加班

Core Insights - The incident highlights significant issues in the quality control of research outputs from major tech companies, specifically regarding a paper from Apple that claimed to present a benchmark for visual reasoning tasks but was found to have a high error rate in its ground truth data [2][6][17]. Group 1: Incident Overview - Lei Yang, a researcher at the company, was excited to adapt a benchmark from an Apple paper that claimed to outperform GPT-5, only to discover a bug in the official code and a ground truth error rate of approximately 30% [2][6][9]. - After reporting the issues to the authors, the initial response was dismissive, leading Yang to publicly comment on the problems, which ultimately resulted in the paper being retracted [10][11][15]. Group 2: Author's Response - The authors acknowledged the oversight in data quality and admitted that the automated processes used to generate ground truth data were flawed, leading to significant errors [17][18]. - They clarified that the example inference code provided in the paper was a dummy example and not intended as a formal demonstration, and expressed regret for their initial handling of Yang's feedback [18].