Workflow
机器之心
icon
Search documents
ICLR 2026 Oral | DPO「只看总分不看细节」?TI-DPO用Token重要性重塑大模型对齐
机器之心· 2026-02-11 03:00
Core Viewpoint - The article discusses the emergence of the TI-DPO framework, which addresses the limitations of the Direct Preference Optimization (DPO) method in fine-tuning large language models, particularly in identifying critical tokens that influence model performance [2][24]. Research Background and Significance - Mainstream methods face two core challenges: the binary classification trap at the sequence level, which oversimplifies data into good and bad categories, and the "pseudo" importance tied to biases in token evaluation, leading to a lack of nuanced semantic control [5][7]. TI-DPO Core Mechanism - TI-DPO introduces a hybrid weighting mechanism and triplet loss to enhance the identification of key tokens while suppressing noise, resulting in more accurate alignment compared to traditional DPO [9][10]. - The hybrid weighting mechanism combines data-driven and prior structural approaches to calculate token weights, while the triplet loss framework structures the optimization as a geometric problem, encouraging the model to generate responses closer to preferred answers [9][10]. Experimental Results - TI-DPO was tested on models like Llama-3 and Mistral-7B, outperforming over 10 alignment algorithms, including DPO and GRPO, with an average score of 62.3 on Llama-3.1-8B-Instruct [13][14]. - In specific tasks such as instruction following, truthfulness, and code generation, TI-DPO significantly surpassed DPO, SimPO, and GRPO, demonstrating its effectiveness in fine detail handling [17][20]. Case Demonstration - A medical consultation case was presented to illustrate TI-DPO's ability to identify critical tokens, showing that the model effectively understood human values rather than merely memorizing responses [22][24]. Summary and Contribution - TI-DPO represents a significant shift from coarse sequence-level optimization to more precise token-level control, clarifying each token's contribution to value alignment. The framework's performance improvements in various tasks validate the effectiveness of enhancing data granularity for model capability [25].
RLinf-USER重磅发布!别再用仿真了,真实世界训练也能「极致效率与系统化」
机器之心· 2026-02-11 03:00
机器之心发布 核心速览: 首个统一系统: 将物理机器人提升为与 GPU 同等的计算资源,打破硬件隔阂。 ⚡️ 极致效率: 全异步架构将真实世界训练吞吐量提升 5.7 倍 。 异构协同: 让不同品牌、不同构型的机器人(如 Franka + ARX)在同一模型下协同进化。 大模型支持: 原生支持 VLA(如 PI0)的云边端在线微调。 Code: https://github.com/RLinf/RLinf 论文链接:https://arxiv.org/abs/2602.07837 今天,我们正式介绍 RLinf-USER —— 一个专为 真实世界在线策略学习 打造的统一且可扩展的系统。它不只是一个训练框架,更是连接数字大脑与物理 躯体的 "神经系统", 是实现千台机器人物理世界策略进化的关键一环。 02. RLinf-USER 是什么? RLinf-USER ( U nified and Extensible S yst E m for R eal-World Online Policy Learning) 是基于 RLinf 基础设施构建的专用系统。它的核心理念只 有一个: 将物理世界的复杂性,封装为简洁的计 ...
里程碑时刻!100B扩散语言模型跑出892 Tokens /秒,AI的另一条路走通了
机器之心· 2026-02-11 01:59
Core Insights - The article discusses the significant advancements in diffusion language models (dLLM), particularly highlighting the release of LLaDA2.1, which marks a transformative moment in this research area [2][4]. - LLaDA2.1 demonstrates a peak speed of 892 tokens per second (TPS) for its 100 billion parameter version, showcasing its efficiency and practical applicability [13][14]. - The model introduces a novel error-correcting editable mechanism, allowing for real-time corrections during text generation, which addresses the limitations of traditional autoregressive models [16][17]. Group 1: Model Features and Innovations - LLaDA2.1 includes two versions: LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B), with the latter achieving remarkable performance metrics [2][4]. - The model employs a dual-mode system, enabling users to switch between a speed-focused mode and a quality-focused mode, thus enhancing usability [20][26]. - The introduction of reinforcement learning in the training process allows LLaDA2.1 to better understand instructions and align with user intent, improving its overall reliability [21][22]. Group 2: Performance Metrics and Comparisons - In benchmark tests, LLaDA2.1 outperformed its predecessor LLaDA2.0 in various tasks, particularly in the quality mode where it exceeded previous performance scores [24][30]. - The model's speed advantage is particularly evident in coding tasks, where it achieved a peak TPS of 891.74 in the HumanEval+ benchmark, significantly enhancing its practical application in programming [28][30]. - The article presents comparative performance data, indicating that LLaDA2.1 consistently surpasses other models in terms of speed and efficiency across multiple benchmarks [25][27]. Group 3: Implications for the Industry - The advancements represented by LLaDA2.1 suggest a potential shift in the landscape of AI language models, moving beyond the dominance of autoregressive models to explore the capabilities of diffusion models [33]. - The successful implementation of a scalable diffusion model at a 100 billion parameter level indicates a breakthrough in overcoming previous limitations related to model size and performance [14][33]. - The article emphasizes that while autoregressive models have been the primary focus, LLaDA2.1 illustrates the viability of alternative approaches, potentially leading to a more diverse range of solutions in the AI language model space [33].
ICLR 2026 | 在Moltbook之外,上交大联合上海AI Lab模拟了AI原⽣社交的「真实暗⾯」
机器之心· 2026-02-11 01:59
本⽂的主要作者来⾃上海交通⼤学和上海⼈⼯智能实验室,核⼼贡献者包括任麒冰、郑志杰、郭嘉轩,指导⽼师为⻢利庄⽼师和邵婧⽼师,研究⽅向为安全可控 ⼤模型和智能体。 是产⽣群体智能,还是会…… 产⽣群体恶意? 近⽇, 上海交通大学与上海人工智能实验室 发表在 ICLR 2026 的最新研究,对多智能体在社交网络中可能出现的金融欺诈协同行为做了深入讨论。本意并不想制 造焦虑,但在高仿真环境下的深度压力测试中,团队发现了一些值得整个社区警惕的趋势。目前,项目已开源,并支持 Clawdbot 接口, 你可以将你的 Clawdbot 接入项目环境,通过与坏人对抗,让你的 Clawdbot 成为「防诈专家」, 平台也支持多个 Clawdbot 在同一环境中实时博弈,适用于协同演化评估。 论⽂链接:https://arxiv.org/pdf/2511.06448 项⽬主⻚:https://zheng977.github.io/MutiAgent4Fraud 项⽬代码:https://github.com/zheng977/MutiAgent4Fraud 最近,Moltbook 的爆⽕与随后的迅速「塌房」,成了 AI 圈绕不开的 ...
比肩OpenAI Simple Codex,中国团队闯入Terminal-Bench全球第二!
机器之心· 2026-02-10 11:03
Core Insights - The competition between Anthropic and OpenAI has intensified with the launch of Claude Opus 4.6 and GPT-5.3-Codex, marking a significant phase in the practical application of large models [1] - The models are designed to enhance autonomous operational capabilities, addressing the commercial viability and user expectations of large models [1] Model Performance - In the Terminal-Bench 2.0 evaluation, Claude Opus 4.6 achieved a score of 65.4%, while GPT-5.3-Codex reached 77.3%, claiming the best coding performance [1] - Feeling AI's CodeBrain-1, based on GPT-5.3-Codex, ranked second globally with a score of 72.9%, making it the only Chinese team in the top 10 [2][3] CodeBrain-1 Features - CodeBrain-1 focuses on efficiently completing coding tasks by utilizing useful context and reducing noise, which helps mitigate the hallucination issues of large language models [9] - It employs a validation feedback mechanism that allows it to learn from errors, thus shortening the generate-validate cycle [9][10] - The model dynamically adjusts plans and strategies, enhancing its operational success rate in real terminal environments [10][11] Terminal-Bench 2.0 Overview - Terminal-Bench 2.0, developed by Stanford University and Laude Institute, is a rigorous benchmark for evaluating AI agents in real command-line environments, with tasks that are complex and require multi-step solutions [13][17] - The benchmark's high difficulty level means that even top models typically score below 65%, highlighting the challenges AI faces in complex system-level tasks [17] Strategic Implications - The emergence of CodeBrain-1 signifies a shift towards a more dynamic interaction model in AI, where the focus is on workflow and application rather than just model capabilities [18] - The competitive landscape is evolving, with Chinese teams like Feeling AI positioning themselves as framework definers in the AI technology innovation path [19]
清华联手千问重塑归一化范式,让 Transformer 回归「深度」学习
机器之心· 2026-02-10 11:03
在十九世纪的暹罗王国曾诞生过这样一对连体兄弟:他们分别拥有完整的四肢和独立的大脑,但他们六十余年的人生被腰部相连着的一段不到十厘米的组织 带永远绑定在了一起。他们的连体曾带来无尽的束缚,直到他们离开暹罗,走上马戏团的舞台。十年间,两兄弟以近乎合二为一的默契巡演欧美,获得巨大 成功。 此后,人们曾用他们的故乡之名,将这种连体现象称作 Siamese Twins(暹罗双胞胎)。后来,这一命名跨越了生物学的边界。1993 年,Yann LeCun 将其引入神经网络,创造了共享权重的 Siamese Network(孪生网络),用于衡量输入的相似性。 时光流转,在二十一世纪的今天,人工智能领域也有一对 "双胞胎"——Pre-Norm(前置归一化)和 Post-Norm(后置归一化)。他们为解决大模型训练 稳定性而生,迅速成为 Transformer 架构中用于稳定信号流的关键范式。 然而,归一化带来的训练稳定性并非没有代价,两种归一化范式之间似乎面临着难以调和的权衡取舍。 尽管近年来 Pre-Norm 被 GPT-3、LLaMA、DeepSeek、Qwen 等知名开源基座所采用,但多项研究共同指向了一个严峻事实:Pr ...
「具身原生」元年!专访原力灵机汪天才,解析具身智能的「PyTorch时刻」
机器之心· 2026-02-10 08:52
Core Viewpoint - The article discusses the significant advancements in embodied intelligence, particularly through the launch of the Dexbotic 2.0 framework and its collaboration with RLinf, marking a pivotal moment in the industry towards a "native embodied" era of AI [3][5][9]. Group 1: Framework and Collaboration - The Dexbotic 2.0 framework aims to standardize the infrastructure for embodied intelligence, similar to how PyTorch revolutionized deep learning [5][16]. - The collaboration with Tsinghua University and RLinf focuses on enhancing the capabilities of embodied AI through a unified framework that integrates perception, decision-making, and execution [3][5][19]. - The introduction of the DM0 model and the DFOL workflow signifies a comprehensive approach to developing and deploying embodied applications [6][51]. Group 2: Embodied Native Concept - "Embodied Native" is defined as a concept that emphasizes a closed-loop system of perception, decision-making, and execution, allowing AI to interact with the physical world effectively [15][13]. - The framework promotes the use of real-world data and multi-modal training to enhance the model's understanding and interaction with its environment [17][41]. - The transition from a "big model brain + mechanical limbs" approach to a fully integrated embodied system is highlighted as a key evolution in the field [12][13]. Group 3: Technical Innovations - Dexbotic 2.0 features a modular design that maintains high flexibility while ensuring end-to-end processing, allowing for independent upgrades of perception, cognition, and control modules [21][33]. - The framework integrates various models and capabilities, including visual-language-action (VLA) and navigation, to achieve comprehensive task execution [37][38]. - The introduction of a standardized data format (Dexdata) and a unified training pipeline addresses the fragmentation in the development of embodied intelligence [45][46]. Group 4: Performance and Evaluation - The DM0 model, with 2.4 billion parameters, has achieved high performance in real-world evaluations, demonstrating its capability in both single and multi-task scenarios [57][58]. - The RoboChallenge benchmark is established to provide a fair evaluation of embodied models, ensuring that performance metrics reflect true capabilities rather than optimized scores [46][57]. - The DFOL workflow enables continuous improvement of robotic systems through real-time data feedback, enhancing their operational efficiency [62][65]. Group 5: Future Insights - The article emphasizes the importance of integrating multi-modal sensory inputs, such as touch and auditory capabilities, to enhance the modeling of the physical world [74]. - The rapid evolution of embodied intelligence is noted, with expectations for significant advancements in the near future, akin to the pace seen in large model developments [73][75]. - The company advocates for an open-source approach to foster collaboration and innovation within the embodied intelligence community, aiming to lower barriers for developers [68][71].
首个测试时共进化合成框架TTCS:在「左右互搏」中突破推理瓶颈
机器之心· 2026-02-10 08:52
Core Insights - The article discusses the emergence of the Test-Time Curriculum Synthesis (TTCS) framework, which addresses challenges in Test-Time Training (TTT) by generating curriculum data that aligns with the model's capability frontier, thus enhancing performance on difficult test problems [2][10][30] Group 1: Motivation and Background - The shift in focus from merely expanding parameters in large language models (LLMs) to leveraging Test-Time Scaling for effective training is highlighted as a core motivation [5] - The existing TTT methods struggle with high-difficulty test questions due to noisy pseudo-labels, leading to ineffective learning [2][7] Group 2: Methodology - TTCS operates through a co-evolutionary framework involving two agents: the Synthesizer, which generates questions at the model's capability frontier, and the Solver, which attempts to solve these questions [11][14] - A capability-adaptive reward mechanism is implemented to ensure that the generated questions are neither too easy nor too difficult, facilitating a dynamic learning environment [16] Group 3: Experimental Results - TTCS demonstrated significant improvements in mathematical reasoning scores, with Qwen2.5-Math-1.5B achieving an average score of 41.49, up from 17.30, marking an increase of +24.19 [3][20] - In challenging AIME competition problems, TTCS outperformed strong baselines like TTRL, showcasing its effectiveness in tackling high-difficulty questions [22][23] Group 4: Broader Implications - The framework not only enhances performance in mathematics but also shows generalization capabilities across various reasoning tasks, indicating that the model learns universal reasoning logic rather than overfitting [22] - The findings suggest that adaptive teaching methods (dynamic Synthesizer) are more effective than static high-level models, emphasizing the importance of tailored learning experiences [25][26] Group 5: Conclusion and Future Outlook - TTCS represents a reconstruction of the Test-Time Computing paradigm, positioning models as active curriculum designers rather than passive problem solvers [30] - The framework addresses critical issues of data scarcity and difficulty gaps in test-time training, paving the way for future self-evolving agents capable of continuous evolution in unknown environments [30]
破解机器人「慢半拍」难题:南洋理工解决VLA致命短板,动态世界断层领先
机器之心· 2026-02-10 03:46
当物体在滚动、滑动、被撞飞,机器人还在执行几百毫秒前的动作预测。 对动态世界而言,这种延迟,往往意味着失败。 在过去几年中,Vision-Language-Action(VLA)模型迅速成为机器人领域的焦点:机器人可以 "看懂" 画面、"理解" 语言指令,并直接输出连续动作, 在静态抓取、摆放、桌面操作等任务中取得了显著进展。 但一个长期被忽视的问题是 —— 真实世界几乎从来不是静态的 。当物体开始移动、加速、碰撞、改变轨迹,当前主流 VLA 模型往往会出现反应迟缓、动 作失配、甚至完全失败的情况。 论文链接:https://arxiv.org/abs/2601.22153 项目链接:https://haozhexie.com/project/dynamic-vla/ GitHub 链接:https://github.com/hzxie/DynamicVLA 在静态场景中,VLA 模型通常遵循如下流程: 问题不在于模型不聪明,而在于:它们跟不上时间。 近日,来自 NTU S-Lab 的研究团队提出 DynamicVLA,首次系统性地从模型架构、推理机制和数据体系三个层面,重新审视并解决 动态物体操控 (Dyn ...
2026开年关键词:Self-Distillation,大模型真正走向「持续学习」
机器之心· 2026-02-10 03:46
机器之心编辑部 2026 年刚拉开序幕,大模型(LLM)领域的研究者们似乎达成了一种默契。 当你翻开最近 arXiv 上最受关注的几篇论文,会发现一个高频出现的词汇: Self-Distillation 。 近年来,基础模型取得了显著的成功,为语言、视觉、机器人等领域的 AI 应用提供了强大的支持。 但在真正落地、长期使用的过程中,研究者逐渐发现:如何让模型在不断吸收新知识的同时,不丢失已有的核心能力 —— 即「持续学习」,正成为制约大 模型进化的关键瓶颈。 传统的强教师依赖范式因成本与数据依赖,难以适配高频的持续进化。 Self-Distillation(自蒸馏) 随之成为破局点 —— 通过合理的上下文引导或反馈机 制 ,模型完全可以构建出一个比当前权重更聪明的临时自我,让模型在没有外部强教师的情况下实现内生增长。 基于这一深刻洞察,由 MIT、ETH Zurich、Meta 及斯坦福等顶尖机构组成的紧密学术圈,在 2026 年 1 月密集发布了三项研究成果。 1.Self-Distillation Enables Continual Learning 在持续学习领域,传统的监督微调(SFT)常因 「灾难性 ...