Workflow
机器之心
icon
Search documents
RewardMap: 通过多阶段强化学习解决细粒度视觉推理的Sparse Reward
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the development of RewardMap, a multi-stage reinforcement learning framework designed to enhance the fine-grained visual reasoning capabilities of multi-modal large language models (MLLMs) in complex scenarios like high-resolution subway maps [3][9][17]. Group 1: Problem Identification - Recent advancements in large language models (LLMs) and multi-modal large models (MLLMs) have raised questions about their ability to interpret complex visual information, particularly in high-resolution and densely structured environments [3]. - The initial work, ReasonMap, revealed that even state-of-the-art MLLMs frequently make errors in path planning, such as misreading lines, missing stations, and repeating routes [3][12]. Group 2: Proposed Solution - The team introduced RewardMap, which employs a multi-stage reinforcement learning framework that incorporates fine-grained rewards and a curriculum-based training approach to improve MLLMs' visual understanding and spatial reasoning [3][10]. - RewardMap breaks down complex route planning tasks into smaller, assessable sub-goals, allowing for a more nuanced feedback mechanism rather than a binary correct/incorrect signal [10][11]. Group 3: Implementation Details - RewardMap is built on the foundation of ReasonMap and includes a dataset covering 30 cities with 4,018 problem samples, categorized into five types to provide detailed supervision during the reinforcement learning phase [6][12]. - The framework's reward function consists of three components: format compliance, final correctness, and detail, with difficulty weights applied to reflect the true complexity of the tasks [11][12]. Group 4: Performance Results - RewardMap demonstrated consistent performance improvements across various benchmarks, achieving a maximum increase of 13.51% in the SpatialEval metric compared to traditional methods [13][14]. - Qualitative comparisons showed that models trained with RewardMap exhibited fewer visual confusions and hallucinations, providing more accurate route information [14][15]. Group 5: Future Outlook - The value of RewardMap extends beyond performance metrics, offering a reusable reinforcement learning paradigm for high-resolution visual tasks by systematically breaking down complex problems into measurable sub-goals [17]. - The framework's effectiveness in enhancing the general capabilities of multi-modal large models has been validated, indicating that real-world data like maps will play a significant role in future developments [18].
大模型微调范式认知再被颠覆?UIUC、Amazon团队最新研究指出SFT灾难性遗忘问题或被误解
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the impact of supervised fine-tuning (SFT) on the general capabilities of large language models (LLMs), suggesting that SFT does not always lead to a significant decline in general performance when a smaller learning rate is used [2][34] - The research challenges the long-held belief that domain-specific fine-tuning inevitably causes catastrophic forgetting of general capabilities, proposing that the choice of training strategy plays a crucial role [2][34] Experiment Details - The study utilized two domain-specific datasets, MedCalc and ESCI, which represent scenarios where open-source LLMs typically perform poorly, making them ideal for domain-specific SFT [5] - Various open-source LLMs were selected for experimentation, including Qwen3-8B and Gemma3-4B, with a focus on controlling the learning rate during SFT [6] Findings - **Finding 1**: Using a smaller learning rate (e.g., 1e-6) allows models to maintain strong performance in target domains while significantly reducing the decline in general capabilities [11] - **Finding 2**: For classification tasks, when the training objective includes only the final label, a wider range of learning rates can achieve ideal performance, as seen in the ESCI dataset [12][14] Theoretical Analysis - The research team concluded that smaller learning rates can effectively limit the decline in general performance, aligning with the experimental findings [17] - The analysis also indicated that when training targets only include final labels, the number of "hard tokens" encountered decreases, allowing for a broader acceptable learning rate range [17] Token Adaptive Loss Reweighting (TALR) - TALR is introduced as a method to dynamically adjust the loss weights of tokens based on their prediction probabilities, effectively reducing the impact of hard tokens during training [20] - The method allows for real-time updates of token weights, ensuring that the model's confidence levels guide the training process [21] Experimental Results - In experiments comparing various strategies to mitigate catastrophic forgetting, TALR demonstrated superior performance, especially under higher learning rates, maintaining domain gains while minimizing losses in general performance [26][27] Conclusion and Future Directions - The research emphasizes the continued importance of SFT in enhancing LLM capabilities, suggesting that while smaller learning rates and TALR are effective, further exploration of more robust strategies is necessary to address the forgetting problem [34][35] - Future research should focus on balancing domain-specific performance with general capabilities, particularly in specialized fields like medicine, where retaining foundational knowledge is crucial [35]
喂了几个月的垃圾推文,大模型得了「脑腐」,这病还治不好
机器之心· 2025-10-21 03:43
Core Viewpoint - The article discusses a study indicating that large language models (LLMs) can experience cognitive decline, referred to as "brain rot," due to prolonged exposure to low-quality internet content, similar to the effects observed in humans [4][10][12]. Group 1: Research Findings - The study conducted by Texas A&M University, the University of Texas at Austin, and Purdue University demonstrates that LLMs can suffer from cognitive degradation when trained on viral Twitter data characterized by short, engaging posts [4][6]. - Cognitive functions such as reasoning and long-term memory showed significant declines, with reasoning ability decreasing by 23% and long-term memory by 30% after exposure to low-quality data [14][15]. - The research established a "brain rot hypothesis," suggesting that continuous exposure to poor-quality text leads to a sustained decline in cognitive abilities of LLMs [12][29]. Group 2: Experimental Methodology - Researchers utilized a controlled experiment on real Twitter data, creating datasets based on engagement (M1) and semantic quality (M2) to assess the impact of low-quality content on LLMs [13][20]. - M1 focused on the popularity and brevity of posts, while M2 evaluated the sensationalism or superficiality of the content, with both methods indicating a negative correlation between data quality and cognitive performance [13][22]. Group 3: Implications and Recommendations - The findings highlight the necessity for regular "cognitive health checks" for deployed LLMs, emphasizing the importance of data quality in maintaining their cognitive capabilities [17][29]. - The study suggests that the effects of exposure to low-quality data are not easily mitigated through standard fine-tuning techniques, indicating a need for improved data curation practices [29].
刚刚,Anthropic上线了网页版Claude Code
机器之心· 2025-10-21 00:15
Core Insights - Anthropic has launched "Claude Code on the web," allowing users to delegate programming tasks directly from their browsers, currently in Beta for Pro and Max users [1][2]. Group 1: Features of Claude Code - The web version of Claude Code enables users to run multiple programming tasks in parallel without needing to open a terminal, connecting to GitHub repositories and providing real-time progress tracking [9]. - The interface is designed to be flexible, accommodating existing workflows of users [10]. - Each task runs in a secure, isolated sandbox environment, ensuring the safety of user code and credentials through controlled Git interactions [12]. Group 2: User Experience and Accessibility - The cloud execution of tasks is particularly beneficial for handling backlog issues, routine fixes, or parallel development work [3]. - Users can also access Claude Code on mobile devices, with an iOS app available for developers to code on the go [11]. - The platform allows users to guide Claude in adjusting task directions during execution, enhancing user control [9].
告别「偏科」,UniVid实现视频理解与生成一体化
机器之心· 2025-10-21 00:15
在视频生成与理解的赛道上,常常见到分头发力的模型:有的专注做视频生成,有的专注做视频理解(如问答、分类、检索等)。而最近, 一个开源项目 UniVid,提出了一个「融合」方向:把理解 + 生成融为一体 —— 他们希望用一个统一的模型,兼顾「看懂视频」+「生成视频」的能力。 这就像把「看图识物」和「画图创作」两件事,交给同一个大脑去做:理解一段文字 + 理解已有视频内容 → 再「画」出新的、连贯的视频 —— 这在技术上挑战 极大。 UniVid 想解决什么问题? UniVid 尝试把视频「理解」与「生成」融合为一体,构建出一个 真正通用的统一视频模型(Unified Video Model), 一个既能「理解」又能「生成」的视频多模 态模型。 论文标题:UniVid: The Open-Source Unified Video Model 论文地址:https://arxiv.org/abs/2509.24200 核心创新 1.统一结构:Adapter-based Unified Architecture 在传统方案中,理解模型和生成模型是完全分开的系统,训练开销大、互通困难。要把它们融合,需要重新训练一个庞大 ...
ICCV 2025 | 扩散模型生成手写体文本行的首次实战,效果惊艳还开源
机器之心· 2025-10-20 09:15
Core Insights - The article introduces a new generative model called Diffusion Brush, which applies diffusion models to generate realistic handwritten text lines in multiple languages, achieving high fidelity in style, content accuracy, and natural layout [2][4][6]. Research Background - The advancement of handwriting imitation technology has reached a point where AI can accurately replicate an individual's handwriting style, leading to potential applications in font design and handwriting verification [4][6]. - Previous models focused on character-level generation, which often resulted in misalignment and unnatural spacing when assembling text lines [6][7]. Key Issues - The researchers identified two main challenges in handwritten text line generation: ensuring that generated text aligns with human writing habits and maintaining both style fidelity and content readability [16][17]. Technical Solutions - The Diffusion Brush model decouples style and content learning to avoid interference, allowing for more accurate style extraction while ensuring content accuracy [11][12]. - The model employs a multi-scale discriminator to provide detailed content supervision at both line and character levels, balancing global and local accuracy [14][19]. Method Framework - The Diffusion Brush framework includes a style module for content decoupling, a style-content fusion module, a conditional diffusion generator, and a multi-scale content discriminator [13][20]. - The style module uses column and row masking strategies to enhance style learning while preserving essential character information [17][30]. Experimental Evaluation - Quantitative assessments show that Diffusion Brush outperforms existing methods in both English and Chinese datasets, achieving significant improvements in various performance metrics [22][23]. - Qualitative evaluations indicate that Diffusion Brush generates text lines that closely resemble reference samples in terms of character slant, ink depth, and stroke width [24][25]. Summary and Outlook - Diffusion Brush represents a significant advancement in generating personalized handwritten text, with potential applications in custom font creation, historical handwriting restoration, and robust text line recognition training [35].
太强了!DeepSeek刚刚开源新模型,用视觉方式压缩一切
机器之心· 2025-10-20 09:15
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR, which demonstrates the potential for nearly 10x lossless contextual compression through text-to-image methods [1][3] - The model has a parameter count of 3 billion and has already seen over 100 downloads shortly after its release [1] - The research team behind DeepSeek-OCR includes Haoran Wei, Yaofeng Sun, and Yukun Li, with Wei having previously developed the GOT-OCR2.0 system [1] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE-A570M decoder [3][10] - DeepEncoder is designed to maintain low activation states under high-resolution inputs while achieving high compression ratios, generating a moderate number of visual tokens [3][14] - The model achieves an OCR accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens, and maintains about 60% accuracy at a compression ratio of 20x [3][28] Performance and Practical Applications - In the OmniDocBench benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 using only 100 visual tokens compared to 256 tokens for GOT-OCR2.0 [5] - The model can generate over 200,000 pages of LLM/VLM training data daily on a single A100-40G GPU [5] - DeepSeek-OCR shows strong practical capabilities, achieving superior performance compared to existing models like MinerU2.0 while using significantly fewer visual tokens [30][32] Training and Data - The training process for DeepSeek-OCR involves two main phases, utilizing a variety of OCR datasets and general visual data [21][24] - The model was trained using 20 nodes, each equipped with 8 A100-40G GPUs, achieving a global batch size of 640 [25] - The training speed reached 90 billion tokens per day for pure text data and 70 billion tokens per day for multimodal data [25] Compression and Recognition Capabilities - DeepSeek-OCR's method of using visual modalities as efficient compression media allows for significantly higher compression rates compared to traditional text representations [9][10] - The model supports recognition of nearly 100 languages, showcasing its versatility in processing diverse document types [42] - It can effectively parse complex layouts and extract structured data from charts, which is crucial for financial and scientific documents [35][40]
NeurIPS 2025 | CMU、清华、UTAustin开源ReinFlow,用在线RL微调机器人流匹配策略
机器之心· 2025-10-20 09:15
Core Insights - The article discusses the emergence of ReinFlow, an online reinforcement learning framework designed to fine-tune flow matching policies, which has been accepted at NeurIPS 2025 and is open-sourced with comprehensive documentation [2][5][27]. Group 1: ReinFlow Overview - ReinFlow is a general framework applicable to all strategies defined by ordinary differential equations, such as Rectified Flow and Shortcut Models, and supports inference with minimal steps [12]. - The framework significantly reduces training time by over 60% compared to DPPO while maintaining similar performance levels [14][16]. Group 2: Algorithm Characteristics - ReinFlow utilizes a strategy gradient theory to convert deterministic flows into discrete-time Markov processes, optimizing the entire flow matching chain [5][7]. - The algorithm introduces a small amount of learnable noise into the deterministic path of the flow strategy, allowing for a stochastic diffusion process that enhances exploration while controlling deviation from the pre-trained strategy [8][10]. Group 3: Performance Metrics - In D4RL locomotion tasks, ReinFlow fine-tuned Rectified Flow strategies achieved an average net performance increase of 135.36%, while reducing the wall-clock time for fine-tuning by 82.63% [16]. - For long-range operation tasks, ReinFlow fine-tuned Shortcut Model strategies improved success rates by an average of 40.34% with fewer diffusion steps, saving an average of 23.20% in training time [18]. Group 4: Experimental Validation - The research team conducted ablation studies to assess the impact of various factors on training outcomes, demonstrating that reinforcement learning fine-tuning can further enhance performance beyond mere data augmentation [24]. - The framework has been validated across multiple benchmark tasks, showing significant performance improvements compared to pre-trained models [14]. Group 5: Open Source and Future Directions - ReinFlow's GitHub project is fully open-sourced and actively maintained, providing a complete codebase, model checkpoints, and detailed documentation for community engagement [27]. - Future updates will include support for various flow models, classic RL environments, and comprehensive guides for installation and usage [29].
突破FHE瓶颈,Lancelot架构实现加密状态下的鲁棒聚合计算,兼顾「隐私保护」与「鲁棒性」
机器之心· 2025-10-20 07:48
Core Insights - The article discusses the integration of Fully Homomorphic Encryption (FHE) with Byzantine Robust Federated Learning (BRFL) through a new framework called Lancelot, which addresses privacy and efficiency challenges in sensitive applications like finance and healthcare [2][15]. Group 1: Framework Overview - Lancelot framework combines FHE and BRFL to enable robust aggregation calculations while maintaining data privacy [2][15]. - The framework effectively addresses the high computational costs associated with traditional FHE, particularly in complex operations like sorting and aggregation [2][15]. Group 2: Innovations in Encryption and Computation - The introduction of Masked-based Encrypted Sorting allows for distance calculations and sorting of model parameters without decryption, overcoming a significant barrier in FHE applications [6][7]. - Lancelot optimizes FHE computation efficiency by improving ciphertext multiplication strategies and polynomial matrix operations, significantly reducing resource consumption [8][9]. Group 3: Hardware Optimization - The framework includes hardware deployment optimizations that reduce unnecessary computational burdens, thereby accelerating the training process [9][10]. - Specific techniques such as Lazy Relinearization and Dynamic Hoisting enhance the overall throughput of the system, achieving training time reductions from hours to minutes [12][13]. Group 4: Practical Applications and Compliance - Lancelot supports various federated robust aggregation algorithms and can integrate with differential privacy mechanisms, ensuring compliance with regulations like GDPR and HIPAA [15]. - Experimental results in medical scenarios demonstrate that Lancelot maintains diagnostic accuracy while preventing information leakage, establishing a foundation for trustworthy AI in healthcare [15].
AGILE:视觉学习新范式!自监督+交互式强化学习助力VLMs感知与推理全面提升
机器之心· 2025-10-20 07:48
Core Insights - Existing Vision-Language Models (VLMs) exhibit significant limitations in fine-grained visual information understanding and reasoning capabilities, which have not been fully activated [2] - AGILE introduces a novel self-supervised learning paradigm that enhances VLMs' visual perception and reasoning through an interactive agent-based approach [2][22] Methodology - AGILE employs a "puzzle" task as an efficient agent task that combines perception and reasoning, structured as a controllable and verifiable interactive form [8] - The training process consists of two phases: a Cold-Start phase using Gemini 2.5 Pro to generate 1.6K high-quality expert puzzle interaction trajectories, and a Reinforcement Learning phase training on 15.6K images using the GRPO algorithm [9][10] Experimental Results - In the simplest 2x2 puzzle task, AGILE improved accuracy from 9.5% to 82.8%, surpassing Gemini 2.5 Pro by 36.4 percentage points. In the more challenging 3x3 puzzle, accuracy increased from 0.4% to 20.8% [13] - The model's performance was evaluated using two metrics: Acc (the proportion of all blocks placed correctly) and Score (the proportion of correctly placed blocks) [13][14] Generalization Capability - After puzzle training, the model demonstrated an average improvement of 3.1% across nine general visual tasks, indicating strong generalization capabilities [15] Scaling Experiments - The study explored the impact of puzzle data scale on performance, revealing that as training data expanded from 0 to 16K, puzzle task accuracy increased from 22.0% to 82.8% [20] - Replacing 10K of conventional QA data with puzzle data in a 20K sample led to better model performance, highlighting the potential of puzzle tasks in alleviating data scarcity in multi-modal reinforcement learning [20]