Workflow
推理编排
icon
Search documents
无需再训练微调,一个辅助系统让GPT-5.2准确率飙到创纪录的75%
机器之心· 2025-12-25 05:26
Core Insights - The article emphasizes that the performance of AI is now determined more by the orchestration of inference rather than the foundational models themselves, suggesting that a well-designed agentic system can significantly enhance AI capabilities without altering the underlying models [1] Group 1: Poetiq's Testing Results - Poetiq reported that their meta-system achieved a score of 75% on the PUBLIC-EVAL dataset using the GPT-5.2 X-High model, which is approximately 15% higher than the previous state-of-the-art (SOTA) models, with each question costing less than $8 [3][7] - The PUBLIC-EVAL dataset includes basic reasoning tasks and standard NLP and mathematical reasoning tests, making it suitable for broad model evaluation [3] - Poetiq did not retrain or specifically optimize GPT-5.2, yet it showed significant improvements in accuracy and cost compared to previous models tested on the same dataset [7] Group 2: Future Implications and Model Exchange - If the performance trends observed in the PUBLIC-EVAL tests continue in the ARC Prize's SEMI-PRIVATE tests, the combination of "GPT-5.2 X-High + Poetiq" could outperform any previous system configurations [7] - Greg Kamradt, president of ARC Prize, expressed optimism about Poetiq's results, noting that the system appears capable of handling model exchanges effectively, although full validation awaits resolution of infrastructure issues with OpenAI API [7] Group 3: System Efficiency and Mechanisms - Poetiq's meta-system is designed to work with any leading model without requiring extensive retraining, allowing for rapid adaptation and performance enhancement as new models are released [15] - The meta-system employs an iterative reasoning process, which differs from traditional single-answer generation methods, incorporating two main mechanisms: iterative problem-solving cycles and self-auditing [16] - The iterative problem-solving cycle allows the system to generate potential solutions, receive feedback, and refine those solutions, while self-auditing enables the system to monitor its progress and determine when to terminate the process, thus reducing unnecessary computational costs [16]