Workflow
Efficient AI
icon
Search documents
AAAI 2026 Oral:明略科技开创稀疏数据「信息瓶颈动态压缩」,精度+速度双SOTA
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the challenges of "Efficient AI," particularly in the context of transformer models becoming larger and more general, while also becoming computationally heavy for edge devices like robots [1][2] - A paper titled "CompTrack," accepted for oral presentation at AAAI 2026, addresses the issue of whether models need to process all input data, showcasing how compression techniques can significantly reduce computational costs while maintaining or even improving model performance [2][14] Redundancy Challenges - Current AI models face "Dual-Redundancy" challenges, which include: 1. Spatial Redundancy: Unrelated background points and blank areas are processed, wasting computational resources and degrading accuracy [3][5] 2. Informational Redundancy: Even in relevant foreground targets, there is a prevalence of redundant and low-value information, which can lead to inefficiencies [5][7] CompTrack Framework - CompTrack proposes an end-to-end framework that addresses both types of redundancy simultaneously [7] - The framework includes: 1. A Spatial Foreground Predictor (SFP) that filters out low-information background noise using information entropy theory [8] 2. An Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module designed to dynamically compress information redundancy in the foreground [10][11] Efficiency and Performance - The IB-DTC module is significant for Efficient AI as it: 1. Is based on the Information Bottleneck principle, retaining only valuable information for predictions [11] 2. Utilizes online Singular Value Decomposition (SVD) for dynamic compression rates based on the intrinsic rank of input data [12] 3. Allows for end-to-end training by using SVD as a guide for optimal compression rates [12] Application and Results - CompTrack has been applied to challenging 3D point cloud tracking tasks, demonstrating that systematic compression of information redundancy is highly effective [14] - The framework not only enhances efficiency but also sets a precedent for addressing information redundancy in various fields, including sensor fusion in robotics and multimodal processing in visual-language models [14][15] - Performance metrics show that CompTrack achieves real-time performance at 80 FPS on RTX 3090, surpassing state-of-the-art methods, with a significant reduction in computational load to 0.94G FLOPs [15]
3B Image Captioning小钢炮重磅来袭,性能比肩Qwen2.5-VL-72B
机器之心· 2025-10-28 04:31
Core Insights - The article introduces a new technology in Dense Image Captioning called CapRL (Captioning Reinforcement Learning), which successfully applies reinforcement learning methods to image captioning tasks, redefining the reward system based on practicality [2][6][10] - The CapRL-3B model achieves captioning performance comparable to Qwen2.5-VL-72B, marking a significant advancement in the field of image captioning and providing important insights for applying GRPO strategies to open tasks [2][12] Summary by Sections Introduction to CapRL - CapRL is a novel approach that addresses the challenge of designing rewards for subjective image description tasks by defining objective verifiable rewards based on practicality [6][10] - The model has been trained to generate high-quality captions that improve upon previous methods, avoiding issues like reward hacking [8][10] Limitations of Existing Methods - Most current image captioning models rely on supervised fine-tuning (SFT), which has limitations such as high costs and lack of generalization due to dependence on large, manually annotated datasets [7][8] - The subjective nature of image descriptions complicates the design of reliable reward functions, leading to potential issues in model training [7][8] CapRL Framework - The CapRL framework employs a two-stage decoupled training process where a language model answers visual questions based on generated captions, using the accuracy of these answers as an objective reward signal [10][13] - This innovative approach significantly enhances the quality of generated captions, improving accuracy and detail coverage while reducing hallucinations [10][11] Experimental Results - The CapRL-3B model was evaluated on the CapRL-5M dataset, showing significant performance improvements across 12 benchmark tests compared to previous models like ShareGPT4V and DenseFusion [12][14] - In direct assessments of caption quality, CapRL-3B's performance is comparable to that of larger models, demonstrating an average improvement of 8.4% over baseline models [12][15] Conclusion and Future Work - The CapRL framework has been open-sourced, with ongoing iterations to enhance its capabilities, inviting further use and exploration by the community [12][19]