Core Insights - The research aligns with Yao Shunyu's perspective that AI is currently in a "halftime" phase, where evaluation will become more important than training, emphasizing the need for models to be tested in real-world tasks rather than just increasing model size [2]. Group 1: Model Performance and Evaluation - The evaluation results from CL-bench reveal that the current leading model, GPT-5.1 (High), has a task-solving rate of only 23.7%, indicating that it fails in over three-quarters of tasks even when provided with all necessary information [4][19]. - A total of ten advanced language models were assessed, with an average task-solving rate of only 17.2%, highlighting a significant gap in their ability to learn from complex contexts [19][27]. - The models struggle to learn from context, with GPT-5.1 (High) ignoring context in 55.3% of cases and misusing it in 1.5% of cases, demonstrating a reliance on static knowledge rather than adapting to new information [24]. Group 2: Context Learning Challenges - The CL-bench framework includes 500 complex contexts and 18,999 tasks designed to require models to learn new knowledge from context, which current models fail to do effectively [6][8]. - The knowledge required for tasks spans various domains, including new field knowledge, unfamiliar rule systems, and complex workflows, which are often not represented in the training data of leading models [8][14]. - Models perform poorly in tasks requiring inductive reasoning from experimental data, with success rates typically below 10%, indicating a need for improved contextual learning capabilities [25][29]. Group 3: Future Directions and Implications - The research emphasizes the necessity for models to genuinely learn from context rather than merely providing it, suggesting that simply offering context is insufficient for task success [27]. - The collaboration between Tencent Hunyuan and Fudan University aims to advance the understanding of context learning in AI, with a clear goal of making contextual learning applicable in real-world scenarios [27]. - The findings suggest that enhancing reasoning capabilities alone is not enough; models must also effectively absorb and organize contextual information to improve performance [29].
姚顺雨腾讯首篇论文:给AI下半场指路“上下文学习”