OpenAI最强AI模型竟成“大忽悠”，o3/o4-mini被曝聪明过头、结果幻觉频发？

Core Insights - OpenAI has launched its new reasoning models, o3 and o4-mini, which are claimed to be the most intelligent models to date, excelling in complex tasks such as coding and mathematics [1][3][8] - Despite their high performance in various benchmarks, these models have been found to produce hallucinations at a significantly higher rate compared to previous versions [9][11] Performance Metrics - In the Codeforces programming test, o3 achieved an Elo score of 2706, surpassing the previous model o1, which scored 1891 [3] - In the GPQA Diamond scientific question-answering test, o3 had an accuracy of 83.3% and o4-mini 81.4%, while o1 only achieved 78% [5] - Both o3 and o4-mini outperformed the older o1 model in the MMMU benchmark test [7] Hallucination Rates - The hallucination rates for the new models are concerning: o3 has a hallucination rate of 33%, and o4-mini has a staggering 48%, compared to o1's 16% and o3-mini's 14.8% [9][11] - Traditional non-reasoning models like GPT-4o have lower hallucination rates than the new reasoning models [9] Design Philosophy and Implications - The increase in hallucination rates may stem from the design philosophy of the o series, which emphasizes reasoning over rote memorization [12] - The shift to a reasoning-first approach has led to significant advancements in areas like programming and mathematics, but it also introduces side effects such as overconfidence and verbosity in responses [12][13] User Experience - Users have expressed mixed feelings about o3, appreciating its coding efficiency but also facing challenges due to its high hallucination rate, necessitating additional verification processes [14][16] - Developers have reported that o3 generates nonsensical code, leading to potential risks in code integrity [15][16]