机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

大模型微调范式认知再被颠覆？UIUC、Amazon团队最新研究指出SFT灾难性遗忘问题或被误解

机器之心· 2025-10-21 03:43

在大模型微调实践中，SFT（监督微调）几乎成为主流流程的一部分，被广泛应用于各类下游任务和专用场景。比如，在医疗领域，研究人员往往会用领域专属数据对大模型进行微调，从而显著提升模型在该领域特定任务上的表现。然而，问题随之而来： SFT 是否会让模型 "遗忘" 原本的通用能力？过去的研究中，不少文献指出，领域微调固然能带来专用性能的提升，但代价是模型在数学推理、代码生成、指令跟随等通用 benchmark 上出现显著退化。这种现象被广泛称为 "灾难性遗忘"。然而，这一长期流传的看法或许值得重新审视。来自 UIUC、Amazon、UT Austin 以及 University at Buffalo 的研究团队最新发布的一项工作就给出了不同的答案。研究表明，领域特定的 SFT 并不总是会严重削弱模型的通用能力。相反，在训练中采用更小的学习率，模型就可能在两方面取得平衡：换句话说，遗忘问题可能更多源于训练策略的选择，而不是单单 SFT 这一范式本身。在通用任务上的能力遗忘被大幅缓解；在目标领域上的表现依然与大学习率相当。 | Jiacheng Lin1, † | Zhongruo Wang2,1 ...

Amazon(US:AMZN)

灾难性遗忘

Token自适应Loss重加权 (TALR)

课程学习

Artificial Intelligence

SFT（监督微调）

灾难性遗忘

Token自适应Loss重加权 (TALR)

课程学习

Artificial Intelligence

SFT（监督微调）

刚刚，Anthropic上线了网页版Claude Code

机器之心· 2025-10-21 00:15

| | | 今天凌晨，Anthropic 上线了「Claude Code on the web」（即网页版 Claude Code）功能，这种全新的方式可以让用户直接从浏览器中委派编程任务。博客地址：https://www.anthropic.com/news/claude-code-on-the-web 目前，网页版 Claude Code 处于 Beta 阶段，作为研究预览版向 Pro 和 Max 用户开放使用。用户可以将多个编程任务交给 Claude 执行，这些任务会在 Anthropic 托管的云端基础设施上运行，非常适合处理漏洞积压、常规修复或并行开发工作。对于一些用户来说，网页版 Claude Code 是「迫切需要」的。并且，允许在浏览器中直接委派编程任务，是迈向高效顺滑软件开发的关键一步。具体来讲，网页版 Claude Code 具有以下三大亮点：一是，并行运行编程任务。在网页端使用 Claude Code，用户无需打开终端就能启动编码会话。连接 GitHub 仓库，描述你的需求，Claude 会负责实现。每个会话都在独立的环境中运行，并具备实时进度追踪功能。用户还可 ...

Efficient Software Development

Artificial Intelligence

Claude Code

Claude Code on the web

Efficient Software Development

Artificial Intelligence

Claude Code

Claude Code on the web

告别「偏科」，UniVid实现视频理解与生成一体化

机器之心· 2025-10-21 00:15

在视频生成与理解的赛道上，常常见到分头发力的模型：有的专注做视频生成，有的专注做视频理解（如问答、分类、检索等）。而最近，一个开源项目 UniVid，提出了一个「融合」方向：把理解 + 生成融为一体 —— 他们希望用一个统一的模型，兼顾「看懂视频」+「生成视频」的能力。这就像把「看图识物」和「画图创作」两件事，交给同一个大脑去做：理解一段文字 + 理解已有视频内容 → 再「画」出新的、连贯的视频 —— 这在技术上挑战极大。 UniVid 想解决什么问题？ UniVid 尝试把视频「理解」与「生成」融合为一体，构建出一个真正通用的统一视频模型（Unified Video Model），一个既能「理解」又能「生成」的视频多模态模型。论文标题：UniVid: The Open-Source Unified Video Model 论文地址：https://arxiv.org/abs/2509.24200 核心创新 1.统一结构：Adapter-based Unified Architecture 在传统方案中，理解模型和生成模型是完全分开的系统，训练开销大、互通困难。要把它们融合，需要重新训练一个庞大 ...

视频理解与生成一体化

统一视频模型

Artificial Intelligence

UniVid

视频理解与生成一体化

统一视频模型

Artificial Intelligence

UniVid

ICCV 2025 | 扩散模型生成手写体文本行的首次实战，效果惊艳还开源

机器之心· 2025-10-20 09:15

此前，相关研究团队已接连发表「 SDT 」(CVPR 2023) 和「 One-DM」 ( ECCV 2024 ) 两项与手写文本风格化生成相关的研究成果，机器之心均进行了相关报道。其中「 One-DM」仅凭单张手写样本便能生成与样本风格相似度很高的任意文本。然而，现有的手写文本生成工作普遍关注「字符级」生成，也即只生成一个单词或是汉字，如果要生成一整段文本行，则只能将若干个字符拼接合成在一起。这就像是你在不同纸上写字，把每个字分别裁剪下来，再组合成一行字。这种做法很容易导致字符不对齐，或上或下，或大或小，看起来歪歪扭扭，并不符合人类的书写习惯。那么，如果 AI 写的字和你写的字一模一样，你会作何感想？是迫不及待地生成一套属于自己的字体，还是担心签名信息不再可靠，抑或是可惜这项技术没能早点出现帮你写作业…… 无论如何，这项笔迹模仿的技术的确已日臻成熟。现在，你只需要在纸上写下几个字，AI 就能准确学习并模仿你的笔迹写出任何字。使用 AI 模仿手写文本，不仅能真实再现书写者风格，轻松创造属于用户个人的字体库，也在字体设计、笔迹验证等诸多领域具有广阔的应用前景。今天要介绍的是 DiffBrush ...

太强了！DeepSeek刚刚开源新模型，用视觉方式压缩一切

机器之心· 2025-10-20 09:15

Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR, which demonstrates the potential for nearly 10x lossless contextual compression through text-to-image methods [1][3] - The model has a parameter count of 3 billion and has already seen over 100 downloads shortly after its release [1] - The research team behind DeepSeek-OCR includes Haoran Wei, Yaofeng Sun, and Yukun Li, with Wei having previously developed the GOT-OCR2.0 system [1] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE-A570M decoder [3][10] - DeepEncoder is designed to maintain low activation states under high-resolution inputs while achieving high compression ratios, generating a moderate number of visual tokens [3][14] - The model achieves an OCR accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens, and maintains about 60% accuracy at a compression ratio of 20x [3][28] Performance and Practical Applications - In the OmniDocBench benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 using only 100 visual tokens compared to 256 tokens for GOT-OCR2.0 [5] - The model can generate over 200,000 pages of LLM/VLM training data daily on a single A100-40G GPU [5] - DeepSeek-OCR shows strong practical capabilities, achieving superior performance compared to existing models like MinerU2.0 while using significantly fewer visual tokens [30][32] Training and Data - The training process for DeepSeek-OCR involves two main phases, utilizing a variety of OCR datasets and general visual data [21][24] - The model was trained using 20 nodes, each equipped with 8 A100-40G GPUs, achieving a global batch size of 640 [25] - The training speed reached 90 billion tokens per day for pure text data and 70 billion tokens per day for multimodal data [25] Compression and Recognition Capabilities - DeepSeek-OCR's method of using visual modalities as efficient compression media allows for significantly higher compression rates compared to traditional text representations [9][10] - The model supports recognition of nearly 100 languages, showcasing its versatility in processing diverse document types [42] - It can effectively parse complex layouts and extract structured data from charts, which is crucial for financial and scientific documents [35][40]

视觉 - 文本压缩

长上下文压缩

Artificial Intelligence

DeepSeek-OCR

视觉 - 文本压缩

长上下文压缩

Artificial Intelligence

DeepSeek-OCR

NeurIPS 2025 | CMU、清华、UTAustin开源ReinFlow，用在线RL微调机器人流匹配策略

机器之心· 2025-10-20 09:15

作者简介：本文第一作者为卡耐基梅隆大学机器人所研究生 Tonghe Zhang，主要研究方向为机器人操作大模型和全身控制算法。合作者为德克萨斯大学奥斯汀分校博士生 Sichang Su, 研究方向为强化学习和通用机器人策略。指导教师是清华大学和北京中关村学院的 Chao Yu 教授以及清华大学 Yu Wang 教授。今年，流匹配无疑是机器人学习领域的大热门：作为扩散模型的一种优雅的变体，流匹配凭借简单、好用的特点，成为了机器人底层操作策略的主流手段，并被广泛应用于先进的 VLA 模型之中 —— 无论是 Physical Intelligence 的，LeRobot 的 SmolVLA, 英伟达的 GR00T 和近期清华大学发布的 RDT2。想要进一步增强开源 VLA 模型的能力，除了增加数据多样性，强化学习也是一种高度有效的方法。来自卡内基梅隆大学、清华大学和德克萨斯大学奥斯汀分校的研究团队提出了一个用于微调流匹配策略的在线强化学习框架 ReinFlow，该工作已被 NeurIPS 2025 接收，并开源了详细的复现教程，包括代码、训练权重、和训练结果。 | WEBSITE VISIT DO ...

突破FHE瓶颈，Lancelot架构实现加密状态下的鲁棒聚合计算，兼顾「隐私保护」与「鲁棒性」

机器之心· 2025-10-20 07:48

Core Insights - The article discusses the integration of Fully Homomorphic Encryption (FHE) with Byzantine Robust Federated Learning (BRFL) through a new framework called Lancelot, which addresses privacy and efficiency challenges in sensitive applications like finance and healthcare [2][15]. Group 1: Framework Overview - Lancelot framework combines FHE and BRFL to enable robust aggregation calculations while maintaining data privacy [2][15]. - The framework effectively addresses the high computational costs associated with traditional FHE, particularly in complex operations like sorting and aggregation [2][15]. Group 2: Innovations in Encryption and Computation - The introduction of Masked-based Encrypted Sorting allows for distance calculations and sorting of model parameters without decryption, overcoming a significant barrier in FHE applications [6][7]. - Lancelot optimizes FHE computation efficiency by improving ciphertext multiplication strategies and polynomial matrix operations, significantly reducing resource consumption [8][9]. Group 3: Hardware Optimization - The framework includes hardware deployment optimizations that reduce unnecessary computational burdens, thereby accelerating the training process [9][10]. - Specific techniques such as Lazy Relinearization and Dynamic Hoisting enhance the overall throughput of the system, achieving training time reductions from hours to minutes [12][13]. Group 4: Practical Applications and Compliance - Lancelot supports various federated robust aggregation algorithms and can integrate with differential privacy mechanisms, ensuring compliance with regulations like GDPR and HIPAA [15]. - Experimental results in medical scenarios demonstrate that Lancelot maintains diagnostic accuracy while preventing information leakage, establishing a foundation for trustworthy AI in healthcare [15].

拜占庭鲁棒联邦学习（BRFL）

隐私计算

Artificial Intelligence

Artificial Intelligence

Lancelot 框架

全同态加密（FHE）技术

AGILE：视觉学习新范式！自监督+交互式强化学习助力VLMs感知与推理全面提升

机器之心· 2025-10-20 07:48

Core Insights - Existing Vision-Language Models (VLMs) exhibit significant limitations in fine-grained visual information understanding and reasoning capabilities, which have not been fully activated [2] - AGILE introduces a novel self-supervised learning paradigm that enhances VLMs' visual perception and reasoning through an interactive agent-based approach [2][22] Methodology - AGILE employs a "puzzle" task as an efficient agent task that combines perception and reasoning, structured as a controllable and verifiable interactive form [8] - The training process consists of two phases: a Cold-Start phase using Gemini 2.5 Pro to generate 1.6K high-quality expert puzzle interaction trajectories, and a Reinforcement Learning phase training on 15.6K images using the GRPO algorithm [9][10] Experimental Results - In the simplest 2x2 puzzle task, AGILE improved accuracy from 9.5% to 82.8%, surpassing Gemini 2.5 Pro by 36.4 percentage points. In the more challenging 3x3 puzzle, accuracy increased from 0.4% to 20.8% [13] - The model's performance was evaluated using two metrics: Acc (the proportion of all blocks placed correctly) and Score (the proportion of correctly placed blocks) [13][14] Generalization Capability - After puzzle training, the model demonstrated an average improvement of 3.1% across nine general visual tasks, indicating strong generalization capabilities [15] Scaling Experiments - The study explored the impact of puzzle data scale on performance, revealing that as training data expanded from 0 to 16K, puzzle task accuracy increased from 22.0% to 82.8% [20] - Replacing 10K of conventional QA data with puzzle data in a 20K sample led to better model performance, highlighting the potential of puzzle tasks in alleviating data scarcity in multi-modal reinforcement learning [20]

自监督学习范式

交互式强化学习

Artificial Intelligence

Artificial Intelligence

AGILE

视觉语言大模型（VLMs）

Gemini 2.5 Pro

微软BitDistill将LLM压缩到1.58比特：10倍内存节省、2.65倍CPU推理加速

机器之心· 2025-10-20 07:48

大语言模型（LLM）不仅在推动通用自然语言处理方面发挥了关键作用，更重要的是，它们已成为支撑多种下游应用如推荐、分类和检索的核心引擎。尽管 LLM 具有广泛的适用性，但在下游任务中高效部署仍面临重大挑战。随着模型规模的急剧扩大，这些挑战被进一步放大，尤其是在资源受限的设备上（如智能手机），内存占用和计算开销都变得极其昂贵。如图 1 所示，直接对已有的全精度 LLM 进行 1.58 比特量化感知训练（Quantization-Aware Training, QAT）时，在特定下游任务上的训练过程往往不稳定，难以保持原有的性能，并表现出较差的可扩展性：当模型规模从 0.6B 增大到 4B 时，相对于全精度基线的性能差距从 13.9 扩大到 15.3。 | 机器之心报道 | | --- | | 编辑：+0、陈陈 | 为应对这些问题，近期研究提出了极低比特（extreme low-bit）LLM，例如使用 1.58 比特（即三值 {-1, 0, 1}）表示的 BitNet。这种方法旨在显著降低内存占用并加速推理，从而为 LLM 在下游应用中的高效部署提供一条可行途径。然而，要让 1.58 比特的 BitNe ...

小红书RecSys 2025最佳论文提名背后：破解视频时长预测难题

机器之心· 2025-10-20 04:50

机器之心报道编辑：Panda 最近，一则趣闻在社交媒体上流传：当诺贝尔奖委员会还在费力寻找新晋生理学或医学奖得主时，一位小红书网友似乎早已在美国落基山脉与他偶遇并聊了天。这件「让世界先一步找到你」的轶事，再次让人们将目光投向了小红书。这真是一个总能创造神奇连接的社区！图源：微博而这种「神奇连接」并非偶然，作为一名科技媒体从业者，我们对此深有体会。许多 AI 领域的关键人物动态与顶会奖项信息，我们都是从小红书上第一时间获知的。正是其强大的推荐系统，确保了这些关键信息能够精准地推送给我们。正是这个对我们工作至关重要的推荐引擎，最近在世界级舞台上大放异彩。在不久前于布拉格落幕的推荐系统顶会 RecSys 2025 上，一篇来自小红书推荐算法团队的论文《 Multi-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network 》引发了现场技术人员和专家们的重点关注和集体热议。这篇论文最终从全球数百篇顶尖研究中脱颖而出，斩获全球仅五篇的「最佳论文提 ...