IMO题库“过时”了！OpenAI内部模型挑战最新First Proof，做了7天错了一半

Core Viewpoint - OpenAI's internal model has demonstrated significant progress in solving real-world mathematical problems, indicating an evolution in its reasoning capabilities, especially in research-level contexts [1][2][52]. Group 1: Model Performance - OpenAI's internal model attempted to solve ten real mathematical problems, with five solutions deemed fundamentally correct [2][11]. - The problems were not standard test questions but derived from actual research scenarios faced by mathematicians, which reduces the likelihood of the model simply recalling answers from training data [5][6]. - The model's performance is noteworthy as it managed to provide reliable answers to specific problems, showcasing its ability to engage in autonomous reasoning rather than mere knowledge recall [52][54]. Group 2: Testing Methodology - The evaluation was conducted over a week, primarily querying the current training model without providing proof strategies or mathematical hints [14]. - Feedback from experts was utilized to refine the model's answers, indicating a collaborative approach to validating the model's outputs [16][18]. - The testing involved a unique set of ten research-level mathematical questions, which are part of the 1st Proof project aimed at assessing AI capabilities in a research-like environment [45][49]. Group 3: Community Engagement and Feedback - The community has actively participated in validating the model's answers, with discussions highlighting the model's impressive advancements in mathematical reasoning [46][52]. - Experts have noted that the framework captures progress in both competition-level mathematics and research-oriented mathematical reasoning [47][48]. - The shift in evaluation paradigms is evident, moving from traditional test scores to real-world problem-solving assessments, which could lead to transformative changes in STEM research [49][51][54].