Workflow
Agent Framework
icon
Search documents
比肩OpenAI Simple Codex,中国团队闯入Terminal-Bench全球第二!
机器之心· 2026-02-10 11:03
Core Insights - The competition between Anthropic and OpenAI has intensified with the launch of Claude Opus 4.6 and GPT-5.3-Codex, marking a significant phase in the practical application of large models [1] - The models are designed to enhance autonomous operational capabilities, addressing the commercial viability and user expectations of large models [1] Model Performance - In the Terminal-Bench 2.0 evaluation, Claude Opus 4.6 achieved a score of 65.4%, while GPT-5.3-Codex reached 77.3%, claiming the best coding performance [1] - Feeling AI's CodeBrain-1, based on GPT-5.3-Codex, ranked second globally with a score of 72.9%, making it the only Chinese team in the top 10 [2][3] CodeBrain-1 Features - CodeBrain-1 focuses on efficiently completing coding tasks by utilizing useful context and reducing noise, which helps mitigate the hallucination issues of large language models [9] - It employs a validation feedback mechanism that allows it to learn from errors, thus shortening the generate-validate cycle [9][10] - The model dynamically adjusts plans and strategies, enhancing its operational success rate in real terminal environments [10][11] Terminal-Bench 2.0 Overview - Terminal-Bench 2.0, developed by Stanford University and Laude Institute, is a rigorous benchmark for evaluating AI agents in real command-line environments, with tasks that are complex and require multi-step solutions [13][17] - The benchmark's high difficulty level means that even top models typically score below 65%, highlighting the challenges AI faces in complex system-level tasks [17] Strategic Implications - The emergence of CodeBrain-1 signifies a shift towards a more dynamic interaction model in AI, where the focus is on workflow and application rather than just model capabilities [18] - The competitive landscape is evolving, with Chinese teams like Feeling AI positioning themselves as framework definers in the AI technology innovation path [19]