超越GPT-4o!华人团队新框架让Qwen跨领域推理提升10%,刷新12项基准测试
量子位·2025-06-04 00:17

Core Insights - A new reinforcement learning method called General-Reasoner has significantly improved the performance of the Qwen series models, surpassing GPT-4o in various benchmarks [1][2]. Group 1: Methodology and Innovations - The General-Reasoner framework enhances cross-domain reasoning accuracy by nearly 10%, addressing limitations of existing Zero-RL methods that focus on single-domain data and rigid validation methods [2][4]. - The research team created a comprehensive reasoning dataset, WebInstruct-verified, consisting of approximately 230,000 high-quality, verifiable reasoning questions across multiple fields such as physics, chemistry, and finance [5][9]. - The dataset was derived from WebInstruct, which initially included around 5 million natural instructions, with a rigorous filtering process to ensure quality and relevance [6][7]. Group 2: Validation Mechanism - A new generative answer verifier, General-Verifier, was developed to replace traditional rule-based validation, significantly improving the accuracy of answer verification across diverse domains [13]. - General-Verifier, with only 1.5 billion parameters, generates reasoning processes and outputs binary correctness judgments, providing accurate and interpretable feedback for reinforcement learning [13]. Group 3: Performance Metrics - The General-Reasoner framework was tested on 12 benchmark tests, showing a 10% improvement in cross-domain tasks compared to the base models, with specific accuracy rates such as 58.9% for Qwen2.5-7B-Base in MMLU-Pro [15]. - The optimal model, General-Reasoner-Qwen3-14B, achieved competitive results against GPT-4o, with accuracy rates of 56.1% in GPQA and 54.4% in TheoremQA [15][16]. Group 4: Future Directions - The research team aims to further optimize model performance, expand high-quality reasoning data across more domains, and enhance the robustness of the verifier to facilitate broader applications of large language models in complex real-world tasks [17].