Workflow
机器之心
icon
Search documents
太强了!DeepSeek刚刚开源新模型,用视觉方式压缩一切
机器之心· 2025-10-20 09:15
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR, which demonstrates the potential for nearly 10x lossless contextual compression through text-to-image methods [1][3] - The model has a parameter count of 3 billion and has already seen over 100 downloads shortly after its release [1] - The research team behind DeepSeek-OCR includes Haoran Wei, Yaofeng Sun, and Yukun Li, with Wei having previously developed the GOT-OCR2.0 system [1] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE-A570M decoder [3][10] - DeepEncoder is designed to maintain low activation states under high-resolution inputs while achieving high compression ratios, generating a moderate number of visual tokens [3][14] - The model achieves an OCR accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens, and maintains about 60% accuracy at a compression ratio of 20x [3][28] Performance and Practical Applications - In the OmniDocBench benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 using only 100 visual tokens compared to 256 tokens for GOT-OCR2.0 [5] - The model can generate over 200,000 pages of LLM/VLM training data daily on a single A100-40G GPU [5] - DeepSeek-OCR shows strong practical capabilities, achieving superior performance compared to existing models like MinerU2.0 while using significantly fewer visual tokens [30][32] Training and Data - The training process for DeepSeek-OCR involves two main phases, utilizing a variety of OCR datasets and general visual data [21][24] - The model was trained using 20 nodes, each equipped with 8 A100-40G GPUs, achieving a global batch size of 640 [25] - The training speed reached 90 billion tokens per day for pure text data and 70 billion tokens per day for multimodal data [25] Compression and Recognition Capabilities - DeepSeek-OCR's method of using visual modalities as efficient compression media allows for significantly higher compression rates compared to traditional text representations [9][10] - The model supports recognition of nearly 100 languages, showcasing its versatility in processing diverse document types [42] - It can effectively parse complex layouts and extract structured data from charts, which is crucial for financial and scientific documents [35][40]
NeurIPS 2025 | CMU、清华、UTAustin开源ReinFlow,用在线RL微调机器人流匹配策略
机器之心· 2025-10-20 09:15
作者简介:本文第一作者为卡耐基梅隆大学机器人所研究生 Tonghe Zhang,主要研究方向为机器人操作大模型和全身控制算法。合作者为德克萨斯大学奥斯汀分 校博士生 Sichang Su, 研究方向为强化学习和通用机器人策略。指导教师是清华大学和北京中关村学院的 Chao Yu 教授以及清华大学 Yu Wang 教授。 今年,流匹配无疑是机器人学习领域的大热门:作为扩散模型的一种优雅的变体,流匹配凭借简单、好用的特点,成为了机器人底层操作策略的主流手段,并被 广泛应用于先进的 VLA 模型之中 —— 无论是 Physical Intelligence 的 ,LeRobot 的 SmolVLA, 英伟达的 GR00T 和近期清华大学发布的 RDT2。 想要进一步增强开源 VLA 模型的能力,除了增加数据多样性,强化学习也是一种高度有效的方法。来自卡内基梅隆大学、清华大学和德克萨斯大学奥斯汀分校的 研究团队提出了一个用于 微调流匹配策略的在线强化学习框架 ReinFlow, 该工作已被 NeurIPS 2025 接收,并开源了详细的复现教程,包括代码、训练权重、和 训练结果。 | WEBSITE VISIT DO ...
突破FHE瓶颈,Lancelot架构实现加密状态下的鲁棒聚合计算,兼顾「隐私保护」与「鲁棒性」
机器之心· 2025-10-20 07:48
Core Insights - The article discusses the integration of Fully Homomorphic Encryption (FHE) with Byzantine Robust Federated Learning (BRFL) through a new framework called Lancelot, which addresses privacy and efficiency challenges in sensitive applications like finance and healthcare [2][15]. Group 1: Framework Overview - Lancelot framework combines FHE and BRFL to enable robust aggregation calculations while maintaining data privacy [2][15]. - The framework effectively addresses the high computational costs associated with traditional FHE, particularly in complex operations like sorting and aggregation [2][15]. Group 2: Innovations in Encryption and Computation - The introduction of Masked-based Encrypted Sorting allows for distance calculations and sorting of model parameters without decryption, overcoming a significant barrier in FHE applications [6][7]. - Lancelot optimizes FHE computation efficiency by improving ciphertext multiplication strategies and polynomial matrix operations, significantly reducing resource consumption [8][9]. Group 3: Hardware Optimization - The framework includes hardware deployment optimizations that reduce unnecessary computational burdens, thereby accelerating the training process [9][10]. - Specific techniques such as Lazy Relinearization and Dynamic Hoisting enhance the overall throughput of the system, achieving training time reductions from hours to minutes [12][13]. Group 4: Practical Applications and Compliance - Lancelot supports various federated robust aggregation algorithms and can integrate with differential privacy mechanisms, ensuring compliance with regulations like GDPR and HIPAA [15]. - Experimental results in medical scenarios demonstrate that Lancelot maintains diagnostic accuracy while preventing information leakage, establishing a foundation for trustworthy AI in healthcare [15].
AGILE:视觉学习新范式!自监督+交互式强化学习助力VLMs感知与推理全面提升
机器之心· 2025-10-20 07:48
Core Insights - Existing Vision-Language Models (VLMs) exhibit significant limitations in fine-grained visual information understanding and reasoning capabilities, which have not been fully activated [2] - AGILE introduces a novel self-supervised learning paradigm that enhances VLMs' visual perception and reasoning through an interactive agent-based approach [2][22] Methodology - AGILE employs a "puzzle" task as an efficient agent task that combines perception and reasoning, structured as a controllable and verifiable interactive form [8] - The training process consists of two phases: a Cold-Start phase using Gemini 2.5 Pro to generate 1.6K high-quality expert puzzle interaction trajectories, and a Reinforcement Learning phase training on 15.6K images using the GRPO algorithm [9][10] Experimental Results - In the simplest 2x2 puzzle task, AGILE improved accuracy from 9.5% to 82.8%, surpassing Gemini 2.5 Pro by 36.4 percentage points. In the more challenging 3x3 puzzle, accuracy increased from 0.4% to 20.8% [13] - The model's performance was evaluated using two metrics: Acc (the proportion of all blocks placed correctly) and Score (the proportion of correctly placed blocks) [13][14] Generalization Capability - After puzzle training, the model demonstrated an average improvement of 3.1% across nine general visual tasks, indicating strong generalization capabilities [15] Scaling Experiments - The study explored the impact of puzzle data scale on performance, revealing that as training data expanded from 0 to 16K, puzzle task accuracy increased from 22.0% to 82.8% [20] - Replacing 10K of conventional QA data with puzzle data in a 20K sample led to better model performance, highlighting the potential of puzzle tasks in alleviating data scarcity in multi-modal reinforcement learning [20]
微软BitDistill将LLM压缩到1.58比特:10倍内存节省、2.65倍CPU推理加速
机器之心· 2025-10-20 07:48
大语言模型(LLM)不仅在推动通用自然语言处理方面发挥了关键作用,更重要的是,它们已成为支撑多种下游应用如推荐、分类和检索的核心引擎。尽管 LLM 具有广泛的适用性,但在下游任务中高效部署仍面临重大挑战。随着模型规模的急剧扩大,这些挑战被进一步放大,尤其是在资源受限的设备上(如智能手 机),内存占用和计算开销都变得极其昂贵。 如图 1 所示,直接对已有的全精度 LLM 进行 1.58 比特量化感知训练(Quantization-Aware Training, QAT)时,在特定下游任务上的训练过程往往不稳定,难以保 持原有的性能,并表现出较差的可扩展性:当模型规模从 0.6B 增大到 4B 时,相对于全精度基线的性能差距从 13.9 扩大到 15.3。 | 机器之心报道 | | --- | | 编辑:+0、陈陈 | 为应对这些问题,近期研究提出了极低比特(extreme low-bit)LLM,例如使用 1.58 比特(即三值 {-1, 0, 1})表示的 BitNet。这种方法旨在显著降低内存占用并加 速推理,从而为 LLM 在下游应用中的高效部署提供一条可行途径。 然而,要让 1.58 比特的 BitNe ...
小红书RecSys 2025最佳论文提名背后:破解视频时长预测难题
机器之心· 2025-10-20 04:50
机器之心报道 编辑:Panda 最近,一则趣闻在社交媒体上流传:当诺贝尔奖委员会还在费力寻找新晋生理学或医学奖得主时,一位小红书网友似乎早已在美国落基山脉与他偶遇并聊了天。 这件「让世界先一步找到你」的轶事,再次让人们将目光投向了小红书。这真是一个总能创造神奇连接的社区! 图源:微博 而这 种「神奇连接」并非偶然,作为一名科技媒体从业者,我们对此深有体会。许多 AI 领域的关键人物动态与顶会奖项信息,我们都是从小红书上第一时间获知 的。正是其强大的推荐系统,确保了这些关键信息能够精准地推送给我们。 正是这个对我们工作至关重要的推荐引擎,最近在世界级舞台上大放异彩。 在不久前于布拉格落幕的推荐系统顶会 RecSys 2025 上,一篇来自小红书推荐算法团队的论文《 Multi-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network 》引发了现场技术人员和专家们的重点关注和集体热议。这篇论文最终从全球数百篇顶尖研究中脱颖而出, 斩获全球仅五篇的「 最佳论文提 ...
轻量高效,即插即用:Video-RAG为长视频理解带来新范式
机器之心· 2025-10-20 04:50
Core Insights - The article discusses the challenges faced by existing visual language models (LVLMs) in understanding long, complex video content, highlighting issues such as context length limitations, cross-modal alignment difficulties, and high computational costs [2][5] - A new framework called Video-RAG has been proposed by researchers from Xiamen University, Rochester University, and Nanjing University, which offers a lightweight and efficient solution for long video understanding tasks without requiring model fine-tuning [2][21] Challenges - Current mainstream methods are categorized into two types, both of which struggle with visual-semantic alignment over long time spans, often sacrificing efficiency for accuracy, making them impractical and less scalable [5][6] - The existing approaches, such as LongVA and VideoAgent, rely on large-scale data for fine-tuning and incur high costs due to frequent calls to commercial APIs [6] Innovations - Video-RAG introduces a novel approach that leverages "retrieval" to bridge the gap between visual and language understanding, utilizing a Retrieval-Augmented Generation (RAG) method that does not depend on model fine-tuning or expensive commercial models [9][21] - The core idea involves extracting text clues that are strongly aligned with visual content from videos, which are then retrieved and injected into the existing LVLM input stream for enhanced semantic guidance [9] Process Overview 1. **Query Decoupling**: User queries are automatically decomposed into multiple retrieval requests, allowing the system to search for relevant information from different modal databases while significantly reducing initial computational load [10] 2. **Multi-modal Text Construction and Retrieval**: Three semantic alignment databases are constructed using open-source tools, ensuring that the retrieved texts are synchronized with the visuals and carry clear semantic labels [11] 3. **Information Fusion and Response Generation**: The retrieved text segments, original queries, and a few key video frames are input into existing LVLMs for final inference output, all without requiring model fine-tuning, thus lowering deployment barriers and computational costs [12] Technical Components - **OCR Text Library**: Utilizes EasyOCR for frame text extraction, combined with Contriever encoding and FAISS vector indexing for rapid retrieval [13] - **Speech Transcription Library (ASR)**: Employs the Whisper model for audio content extraction and embedding [13] - **Object Semantic Library (DET)**: Uses the APE model to detect objects and their spatial relationships in key frames, generating structured descriptive text [13] Performance and Advantages - Video-RAG allows LVLMs to focus more on relevant visual information post-retrieval, effectively reducing modality gaps, and is characterized as lightweight, efficient, and high-performing [15] - The framework is plug-and-play, compatible with any open-source LVLM without requiring modifications to model architecture or retraining [16] - In benchmark tests, Video-RAG outperformed commercial closed-source models like GPT-4o and Gemini 1.5 when combined with a 72B parameter open-source LVLM, demonstrating remarkable competitiveness [18] Outcomes and Significance - The success of Video-RAG validates a significant direction in enhancing cross-modal understanding capabilities by introducing high-quality, visually aligned auxiliary text, thus overcoming context window limitations [21] - This framework addresses issues of "hallucination" and "attention dispersion" in long video understanding and establishes a low-cost, highly scalable technical paradigm applicable in various real-world scenarios such as education, security, and medical imaging analysis [21]
Codeforces难题不够刷?谢赛宁等造了个AI出题机,能生成原创编程题
机器之心· 2025-10-20 04:50
随着大型语言模型(LLM)朝着通用能力迈进,并以通用人工智能(AGI)为最终目标,测试其生成问题的能力也正变得越来越重要。尤其是在将 LLM 应用于高 级编程任务时,因为未来 LLM 编程能力的发展和经济整合将需要大量的验证工作。 首先, 为编程竞赛出 题需要 比解决问题更深刻的算法理解 。 例如,基础问题可能会被归结为可识别的模板,用简单的技巧就能解决;许多标准的编程问题也常常允许提交部分正确或样板化的解决方案,这可能会掩盖错误 的推理过程。而竞赛编程问题有着严格的标准,旨在评估对底层算法设计原则、数据结构和复杂性权衡的更深层次理解。验证数量庞大的可能解法,并充分覆盖 各种捷径或边界情况是极具挑战性的,但这对于竞赛编程问题而言是必需的。因此,出题不仅包含了解决问题的所有挑战,甚至还超越了它。 其次, 更好的出题能力将带来更严谨的竞赛编程基准测试 。由于像 Codeforces 和 AtCoder 这类顶级平台的官方测试数据并不公开,研究人员目前依赖于合成的数 据集,如 CodeContests+、TACO 和 HardTests。 然而,分析表明,现有的测试数据集可能同时存在高误报率(FPR)和高漏报率(F ...
SIGGRAPH Asia 2025 | OmniPart框架,让3D内容创作像拼搭积木一样简单
机器之心· 2025-10-20 04:50
Core Viewpoint - The article introduces OmniPart, a novel framework for part-aware 3D generation that addresses the challenge of creating, editing, and combining 3D object components, enhancing the quality and efficiency of 3D content creation [2][23]. Summary by Sections Introduction - Researchers from Hong Kong University, VAST, Harbin Institute of Technology, and Zhejiang University have developed OmniPart, which has been accepted for presentation at SIGGRAPH Asia 2025 [2]. Methodology - OmniPart employs a two-stage "planning-generation" strategy, decoupling complex generation tasks into controllable structure planning and spatially-conditioned part synthesis [8][10]. First Stage: Structure Planning - The first stage involves planning the 3D object's component layout using a self-regressive Transformer model that predicts bounding boxes based on 2D images. Users can control the decomposition granularity through flexible 2D part masks [10][11]. Second Stage: Part Generation - The second stage generates high-quality 3D parts based on the spatial blueprint created in the first stage. It utilizes a pre-trained 3D generator (TRELLIS) for efficient fine-tuning, ensuring high consistency among parts [12][13]. Experimental Results - OmniPart demonstrates superior generation quality compared to existing methods like Part123 and PartGen, excelling in geometric detail, semantic accuracy, and structural consistency [14][16]. - The efficiency of OmniPart is significantly improved, completing the end-to-end generation process in approximately 0.75 minutes, compared to 15 minutes for Part123 and 5 minutes for PartGen [16]. Applications - OmniPart supports various downstream applications, including mask-controlled generation, multi-granularity generation, material editing, and geometry processing, enhancing the editing and customization capabilities of 3D content [18][20][21]. Conclusion - The OmniPart framework sets a new benchmark in quality and efficiency for part-level 3D content generation, paving the way for advancements in game development, animation, and virtual reality [23].
无需再训练即可增强性能!港大团队提出GPC框架,实现机器人「策略组合」
机器之心· 2025-10-19 09:17
本文一作曹嘉航,香港大学在读博士生,前北京人形机器人创新中心实习生;共同一作黄翊泽,上海交通大学在读本科生;通讯导师 Andrew F. Luo,香港大学助 理教授。 在机器人学习领域,提升基于生成式模型的控制策略(Policy)的性能通常意味着投入巨额成本进行额外的数据采集和模型训练,这极大地限制了机器人能力的快 速迭代与升级。面对模型性能的瓶颈,如何在不增加训练负担的情况下,进一步挖掘并增强现有策略的潜力? 香港大学团队开创性地提出了 GPC(General Policy Composition,通用策略组合) 框架,为这一挑战提供了全新的免训练解决方案。该框架通过在测试时(test- time)对多个预训练模型进行 "策略组合",能够创造出一个性能超越任何单一父策略的 "组合策略"。 GPC 作为一个 "即插即用" 的通用框架,能够灵活融合不同架构(如 Diffusion-based Policy、Flow-based Policy)、不同模态(如视觉-动作模型 VA、视觉-语言-动 作模型 VLA)的机器人策略,打破了传统性能提升方式对数据和算力的依赖。 论文标题: Compose Your Poli ...