刚刚，腾讯姚顺雨团队首个成果发布，揭示大模型真正瓶颈

Core Insights - Tencent's Mix Yuan team has launched a new benchmark called CL-bench, aimed at evaluating the ability of large language models to learn new knowledge from context and apply it correctly [1][7][30] Group 1: Benchmark Overview - CL-bench includes 500 complex context tasks, 1899 tasks, and 31607 validation standards, focusing on the requirement for models to learn new knowledge from context that is not present in their pre-training [9][28] - The benchmark aims to bridge the gap between the static memory of models and the dynamic learning capabilities of humans, emphasizing the need for models to adapt to real-world tasks [5][7] Group 2: Model Performance - The average success rate of models on CL-bench is only 17.2%, with the best-performing model, GPT-5.1 (High), achieving a success rate of 23.7% [15][16] - The evaluation revealed that many models fail to utilize context effectively, with significant percentages of tasks being ignored or misused [17][18] Group 3: Key Findings - Ignoring or misusing context is identified as a primary reason for model failures, indicating that models often rely on static knowledge rather than adapting to new information [17] - The ability to perform inductive reasoning from experimental data is found to be more challenging than applying deductive reasoning based on provided rules [20] - The complexity of context, rather than just its length, significantly impacts the difficulty of tasks, highlighting the need for models to improve their context learning capabilities [25][30] Group 4: Future Directions - The Mix Yuan team plans to focus on enhancing models' context learning abilities and ensuring that knowledge learned from context is retained, which may shift the role of humans from data providers to context providers [30]