Workflow
机器之心
icon
Search documents
比肩OpenAI Simple Codex,中国团队闯入Terminal-Bench全球第二!
机器之心· 2026-02-10 11:03
Core Insights - The competition between Anthropic and OpenAI has intensified with the launch of Claude Opus 4.6 and GPT-5.3-Codex, marking a significant phase in the practical application of large models [1] - The models are designed to enhance autonomous operational capabilities, addressing the commercial viability and user expectations of large models [1] Model Performance - In the Terminal-Bench 2.0 evaluation, Claude Opus 4.6 achieved a score of 65.4%, while GPT-5.3-Codex reached 77.3%, claiming the best coding performance [1] - Feeling AI's CodeBrain-1, based on GPT-5.3-Codex, ranked second globally with a score of 72.9%, making it the only Chinese team in the top 10 [2][3] CodeBrain-1 Features - CodeBrain-1 focuses on efficiently completing coding tasks by utilizing useful context and reducing noise, which helps mitigate the hallucination issues of large language models [9] - It employs a validation feedback mechanism that allows it to learn from errors, thus shortening the generate-validate cycle [9][10] - The model dynamically adjusts plans and strategies, enhancing its operational success rate in real terminal environments [10][11] Terminal-Bench 2.0 Overview - Terminal-Bench 2.0, developed by Stanford University and Laude Institute, is a rigorous benchmark for evaluating AI agents in real command-line environments, with tasks that are complex and require multi-step solutions [13][17] - The benchmark's high difficulty level means that even top models typically score below 65%, highlighting the challenges AI faces in complex system-level tasks [17] Strategic Implications - The emergence of CodeBrain-1 signifies a shift towards a more dynamic interaction model in AI, where the focus is on workflow and application rather than just model capabilities [18] - The competitive landscape is evolving, with Chinese teams like Feeling AI positioning themselves as framework definers in the AI technology innovation path [19]
清华联手千问重塑归一化范式,让 Transformer 回归「深度」学习
机器之心· 2026-02-10 11:03
在十九世纪的暹罗王国曾诞生过这样一对连体兄弟:他们分别拥有完整的四肢和独立的大脑,但他们六十余年的人生被腰部相连着的一段不到十厘米的组织 带永远绑定在了一起。他们的连体曾带来无尽的束缚,直到他们离开暹罗,走上马戏团的舞台。十年间,两兄弟以近乎合二为一的默契巡演欧美,获得巨大 成功。 此后,人们曾用他们的故乡之名,将这种连体现象称作 Siamese Twins(暹罗双胞胎)。后来,这一命名跨越了生物学的边界。1993 年,Yann LeCun 将其引入神经网络,创造了共享权重的 Siamese Network(孪生网络),用于衡量输入的相似性。 时光流转,在二十一世纪的今天,人工智能领域也有一对 "双胞胎"——Pre-Norm(前置归一化)和 Post-Norm(后置归一化)。他们为解决大模型训练 稳定性而生,迅速成为 Transformer 架构中用于稳定信号流的关键范式。 然而,归一化带来的训练稳定性并非没有代价,两种归一化范式之间似乎面临着难以调和的权衡取舍。 尽管近年来 Pre-Norm 被 GPT-3、LLaMA、DeepSeek、Qwen 等知名开源基座所采用,但多项研究共同指向了一个严峻事实:Pr ...
「具身原生」元年!专访原力灵机汪天才,解析具身智能的「PyTorch时刻」
机器之心· 2026-02-10 08:52
编辑|Panda 在数字世界里,AI 智能体正通过 MoltBook 这样社交网络进行语义协商并协同进化,而在物理世界中,具身智能也迎来了一次里程碑式的进展。 在原力灵机的实验室里,一台由 Hugging Face 开源的、3D 打印出来的 SO-101 机器臂正灵巧地将外形各异的物品放入指定的盒子中。这个动作看似简单,实则包 含了极高频的视觉反馈、力度感知以及对复杂物理环境的 直觉 判断。 实际上,这种从「计算」到「直觉」的跨越,并非源于针对这个特定硬件的繁琐调优,而是受益于一套标准化的底层基建。 在 2 月 10 日的技术开发日上,原力灵机正式发布了 开源具 身原 生框架 Dexbotic 2.0,并宣布了其与清华大学和无问芯穹支持的强化学习框架 RLinf 的战略合作 原力灵机合伙人汪天才将 Dexbotic 2.0 与 RLinf 的深度结合定义为 具身智能行业的「PyTorch 时刻」 。正如 PyTorch 通过标准化的张量计算与自动微分机制释放了 深度学习的生产力,Dexbotic 2.0 与 RLinf 的联手试图在具身智能这个碎片化严重的赛道上,建立一套通用的底座和基础设施。 伴随这一框架升 ...
首个测试时共进化合成框架TTCS:在「左右互搏」中突破推理瓶颈
机器之心· 2026-02-10 08:52
随着大语言模型(LLM)的发展,业界共识已从单纯的「预训练扩大参数」转向挖掘测试时扩展(Test-Time Scaling)的潜力。 论文标题: TTCS: Test-Time Curriculum Synthesis for Self-Evolving 论文链接:https://arxiv.org/abs/2601.22628 项目代码:https://github.com/XMUDeepLIT/TTCS HuggingFace 主页:https://huggingface.co/papers/2601.22628 在 DeepSeek-R1 和 OpenAI o1 引领的「后训练(Post-Training)」与「测试时扩展」(Test-Time Scaling)」时代,如何利用测试时的算力进行有效训练成为焦点。 然而,面对极难的测试题,现有的测试时训练(Test-Time Training, TTT)往往因伪标签噪声大而陷入「瞎猜」的困境。 厦门大学 DeepLIT 课题组 提出了一种全新的测试时课程合成框架 —— TTCS (Test-Time Curriculum Synthesis) 。该框架不依 ...
破解机器人「慢半拍」难题:南洋理工解决VLA致命短板,动态世界断层领先
机器之心· 2026-02-10 03:46
当物体在滚动、滑动、被撞飞,机器人还在执行几百毫秒前的动作预测。 对动态世界而言,这种延迟,往往意味着失败。 在过去几年中,Vision-Language-Action(VLA)模型迅速成为机器人领域的焦点:机器人可以 "看懂" 画面、"理解" 语言指令,并直接输出连续动作, 在静态抓取、摆放、桌面操作等任务中取得了显著进展。 但一个长期被忽视的问题是 —— 真实世界几乎从来不是静态的 。当物体开始移动、加速、碰撞、改变轨迹,当前主流 VLA 模型往往会出现反应迟缓、动 作失配、甚至完全失败的情况。 论文链接:https://arxiv.org/abs/2601.22153 项目链接:https://haozhexie.com/project/dynamic-vla/ GitHub 链接:https://github.com/hzxie/DynamicVLA 在静态场景中,VLA 模型通常遵循如下流程: 问题不在于模型不聪明,而在于:它们跟不上时间。 近日,来自 NTU S-Lab 的研究团队提出 DynamicVLA,首次系统性地从模型架构、推理机制和数据体系三个层面,重新审视并解决 动态物体操控 (Dyn ...
2026开年关键词:Self-Distillation,大模型真正走向「持续学习」
机器之心· 2026-02-10 03:46
机器之心编辑部 2026 年刚拉开序幕,大模型(LLM)领域的研究者们似乎达成了一种默契。 当你翻开最近 arXiv 上最受关注的几篇论文,会发现一个高频出现的词汇: Self-Distillation 。 近年来,基础模型取得了显著的成功,为语言、视觉、机器人等领域的 AI 应用提供了强大的支持。 但在真正落地、长期使用的过程中,研究者逐渐发现:如何让模型在不断吸收新知识的同时,不丢失已有的核心能力 —— 即「持续学习」,正成为制约大 模型进化的关键瓶颈。 传统的强教师依赖范式因成本与数据依赖,难以适配高频的持续进化。 Self-Distillation(自蒸馏) 随之成为破局点 —— 通过合理的上下文引导或反馈机 制 ,模型完全可以构建出一个比当前权重更聪明的临时自我,让模型在没有外部强教师的情况下实现内生增长。 基于这一深刻洞察,由 MIT、ETH Zurich、Meta 及斯坦福等顶尖机构组成的紧密学术圈,在 2026 年 1 月密集发布了三项研究成果。 1.Self-Distillation Enables Continual Learning 在持续学习领域,传统的监督微调(SFT)常因 「灾难性 ...
想让机器人春晚包饺子?阿里达摩院:别急,先把「大脑」优化一下
机器之心· 2026-02-10 03:46
2026 年,那么多机器人上春晚,能给大家表演个包饺子吗?相信这是很多人会好奇的一个问题。 但根据最近的彩排报道,这个可能性不大,机器人更有可能被设计为托着托盘呈上饺子。 其实业内人士都知道,如果不靠编程或摇操,让机器人包饺子这事儿远比移动、导航要复杂,还涉及到「饺子皮」这种堪称机器人图灵测试的柔性物体,没有一 个足够聪明的「大脑」肯定是做不到的。这也是为什么,在过去的一年, 越来越多的研究力量和资金开始涌向「大脑」 。 编辑|张倩 阿里达摩院最近的一项工作 —— RynnBrain 也瞄准了这一方向。不过和一些表演叠衣服、做早餐的研究不同,他们关注的问题还要更底层一些:如果机器人在做 家务的时候被打断,临时去门口接收快递,它还能不能回来接着刷碗?如果机器人被要求完成一件需要借助很多工具的任务,它制定的计划会不会包含手头压根 没有的工具? 在关于机器人的各种宏大叙事里,这些问题可能没有那么起眼,甚至连相关的 benchmark 都是缺失的,但却是机器人走出实验室必须迈过的门槛。在 RynnBrain 的 构建中,达摩院具身智能团队选择从底层出发,将 时空记忆 和 物理空间推理 直接训进模型里,并且达到了不错的 ...
这个春节,AI 不聊天了,开始替我买单
机器之心· 2026-02-09 05:12
Core Viewpoint - The article discusses the competitive landscape of AI applications during the Chinese New Year, highlighting how major tech companies are leveraging AI to attract user attention and enhance consumer experiences through innovative strategies. Group 1: AI Competition and Strategies - Tencent initiated the AI Spring Festival battle by launching a 1 billion cash red envelope campaign and introducing a new AI social feature called "Yuanbao" [1] - Baidu followed with a 500 million red envelope initiative, collaborating with Beijing TV for the Spring Festival Gala [1] - ByteDance brought its Volcano Engine to the backstage of the CCTV Spring Festival Gala, intensifying the competition [1] - Alibaba's Qianwen APP entered the fray with a 3 billion "Spring Festival Treat Plan," integrating its ecosystem to offer a comprehensive consumer experience [1] Group 2: User Engagement and Order Volume - On the first day of the Qianwen APP event, over 10 million AI orders were completed within 9 hours, with more than 30 million "help me buy" requests received [2] - The overwhelming user engagement led to server congestion, prompting the Qianwen team to request leniency from users [5] Group 3: New Consumption Habits and AI Capabilities - The 3 billion investment not only aimed at distributing cash but also tested new consumer habits, encouraging users to interact with the Qianwen APP [6] - Qianwen demonstrated impressive cross-application coordination, efficiently managing travel planning by integrating services from Fliggy and Gaode [10] - In the family consumption sector, Qianwen acted as a "universal shopping guide," quickly filtering products from Taobao and Tmall based on user needs [13] Group 4: Differentiation in AI Development - The article notes the distinct strategies among tech companies, with Qianwen focusing on integrating AI into shopping and ticketing, thus targeting higher-value consumer scenarios [15] - The differences in AI development paths between China and the U.S. are highlighted, with U.S. AI primarily focusing on high-value B2B markets, while China's AI is more consumer-oriented [20][21] Group 5: Future of AI Applications - The article posits that 2023 and 2024 will be pivotal years for AI technology, with 2025 marking a year for application exploration, as evidenced by Qianwen's efforts to break the boundaries of AI applications [26] - Qianwen's approach combines AI capabilities with a robust ecosystem, enabling seamless integration of AI into everyday life [28] - The evolution of user interaction with digital platforms is shifting towards a more streamlined experience, emphasizing efficiency and convenience [29]
CVPR 2026 Workshop征稿|第六届AdvML@CV:多模态大模型智能体安全
机器之心· 2026-02-09 05:12
Core Viewpoint - The article announces the 6th AdvML@CV workshop focusing on the safety and robustness of vision-language agents, scheduled during the CVPR 2026 conference in Denver, Colorado from June 3 to June 7, 2026 [2][3]. Group 1: Workshop Themes - The workshop will address the safety and robustness of vision-language agents, which have seen significant advancements due to multimodal foundational models [4][5]. - Vision-language agents are becoming integral in fields such as autonomous driving and intelligent robotics, but their increased autonomy introduces complex security risks, including adversarial prompts, instruction injection, and jailbreak manipulations [5]. Group 2: Call for Papers - The workshop invites submissions related to various topics, including attacks and defenses on vision-language agents, as well as datasets and benchmarks for evaluating these agents [6][7]. - Specific areas of interest for submissions include adversarial/jailbreak attacks, improving agent robustness, and aligning vision-language agents [10]. Group 3: Submission Guidelines - Long papers should be a maximum of 8 pages (excluding references), while extended abstracts should be no more than 4 pages (including references) [10]. - All submissions must be anonymous and adhere to the CVPR 2026 Author Kit template [10]. Group 4: Important Dates - Abstract and paper submission deadline is set for March 5, 2026, with author notifications on March 17, 2026, and camera-ready submissions due by April 1, 2026 [10].
先解行为,再训Agent:CMU开源首份Agentic Search日志数据,把Agent拆开给你看
机器之心· 2026-02-09 01:18
在大模型驱动的 Agentic Search 日益常态化的背景下,真实环境中智能体 "如何发查询、如何改写、是否真正用上检索信息" 一直缺乏系统刻画与分析。 CMU 团队基于可重复检索平台 DeepResearchGym,从统一后端的半年真实流量中整理出 1400 万余条搜索请求、约 400 万个会话,在严格匿名化与清洗后,构建并 于 Hugging Face 开源了首个 Agentic Search 行为日志数据集。 在此基础上,工作提出 "会话意图(Declarative / Procedural / Reasoning)→轨迹动作(专化 / 泛化 / 探索 / 重复)→检索信息采纳率(CTAR)" 三层分析框架,利用 LLM 进行会话切分与标签推断,刻画出智能体搜索中普遍存在的下钻偏好、事实型任务中的重试循环,以及不同改写模式对历史检索信息依赖程度的显著差异。 总体而言,该研究既为观察与评估 Agentic Search 行为提供了首个大规模开源日志,也为后续在智能体训练与系统设计中显式建模 "会不会搜" 提供了可复现的数据 基础与可量化的行为信号。 论文标题: Agentic Search in th ...