GPT-4o准确率仅为24%，权威中文教育基准：知识+情商的双重考验

Core Insights - The article discusses the launch of OmniEduBench by East China Normal University, which evaluates the educational capabilities of large language models (LLMs) from both knowledge and cultivation dimensions, revealing significant shortcomings in AI's ability to support education effectively [1][20]. Group 1: Evaluation Framework - OmniEduBench introduces a dual-dimensional assessment system focusing on both knowledge and cultivation capabilities, addressing the limitations of existing benchmarks that primarily assess knowledge [5][17]. - The knowledge dimension includes 18,121 items covering various educational levels and subjects, while the cultivation dimension consists of 6,481 items that evaluate soft skills essential for teaching [6][7]. Group 2: Limitations of Current Models - The study found that even top models like GPT-4o performed poorly in the knowledge dimension, with an accuracy of only 24.17%, indicating a lack of adaptability to the diverse and localized nature of Chinese educational assessments [14][16]. - In the cultivation dimension, all models exhibited significant gaps compared to human performance, with the best model achieving only 70.27% accuracy, highlighting a widespread deficiency in emotional intelligence and heuristic guidance [16][21]. Group 3: Importance of OmniEduBench - OmniEduBench is crucial as it systematically quantifies the interactive capabilities of educational AI, emphasizing that these models should not merely function as problem solvers but also facilitate meaningful educational interactions [17][19]. - The benchmark is tailored to the unique linguistic and cultural aspects of Chinese education, making it a more relevant tool for assessing model performance in local contexts [19][20]. Group 4: Future Directions - The research team plans to explore more complex problem types within the cultivation dimension and incorporate multimodal educational scenarios to enhance the comprehensive capabilities of LLMs in education [21].