机器之心
Search documents
AEPO:智能体熵平衡策略优化,让探索更稳,推理更深!
机器之心· 2025-11-01 04:22
Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].
上海AI Lab发布混合扩散语言模型SDAR:首个突破6600 tgs的开源扩散语言模型
机器之心· 2025-11-01 04:22
Core Insights - The article introduces a new paradigm called SDAR (Synergistic Diffusion-AutoRegression) that addresses the slow inference speed and high costs associated with large model applications, which are primarily due to the serial nature of autoregressive (AR) models [2][3][4]. Group 1: SDAR Paradigm - SDAR effectively decouples training and inference, combining the high performance of AR models with the parallel inference advantages of diffusion models, allowing for low-cost transformation of any AR model into a parallel decoding model [4][11]. - Experimental results show that SDAR not only matches but often surpasses the performance of original AR models across multiple benchmarks, achieving up to a 12.3 percentage point advantage in complex scientific reasoning tasks [6][28]. Group 2: Performance and Efficiency - SDAR maintains the performance of AR models while significantly improving inference speed and reducing costs, demonstrating that larger models benefit more from parallelization without sacrificing performance [17][19]. - The research indicates that SDAR can be adapted to any mainstream AR model at a low cost, achieving comparable or superior performance in downstream tasks [19][29]. Group 3: Experimental Validation - The study conducted rigorous experiments to compare SDAR's performance with AR models, confirming that SDAR can achieve substantial speed improvements in real-world applications, with SDAR-8B-chat showing a 2.3 times acceleration over its AR counterpart [23][20]. - The results highlight that SDAR's unique generation mechanism does not compromise its complex reasoning capabilities, retaining long-chain reasoning abilities and excelling in tasks requiring understanding of structured information [28][29]. Group 4: Future Implications - SDAR represents a significant advancement in the field of large models, providing a powerful and flexible tool that lowers application barriers and opens new avenues for exploring higher performance and efficiency in AI reasoning paradigms [29][31].
5分钟上手,无照就能飞:91万「空中F1」已经排到了2027
机器之心· 2025-11-01 04:22
Core Viewpoint - The article discusses the emergence of personal flying vehicles, particularly focusing on the Jetson ONE, a single-person electric vertical takeoff and landing aircraft, and its potential to revolutionize transportation [5][6]. Group 1: Jetson ONE Overview - Jetson ONE is a personal flying vehicle that operates on electric power, featuring eight motors and a structure made of aluminum and carbon fiber [5]. - The vehicle has gained significant attention, with a video showcasing its flight amassing over 4 million views on social media [6]. - It was introduced at the 2025 UP.Summit, where a flying competition was held, highlighting its performance capabilities [6][7]. Group 2: Specifications and Features - The Jetson ONE weighs 121 pounds (approximately 55 kg) without the battery and 253 pounds (about 115 kg) with the battery, allowing a maximum payload of 210 pounds (approximately 95 kg) [7]. - It can reach a maximum speed of 63 miles per hour (about 101 km/h) and has a maximum flight altitude of over 1,500 feet (approximately 457.2 meters) [7]. - The aircraft is classified as an ultralight vehicle by the FAA, meaning no pilot's license is required to operate it, and it can be controlled with a single joystick [7]. Group 3: Market and Pricing - The current pre-order price for the Jetson ONE is $128,000 (approximately 910,000 RMB), which will increase to $148,000 (about 1,050,000 RMB) shortly [7]. - The 2025 and 2026 models are sold out, with new orders expected to be delivered as early as 2027 [8]. Group 4: Industry Context and Comparisons - Other companies are also developing personal flying vehicles, such as Alef Aeronautics, which has created a flying car that can operate on regular roads and has a flight range of 170 kilometers [18]. - The article mentions a human-powered flying bicycle developed by students in Japan, showcasing the diversity of innovative flying vehicle designs [20]. - The Volonaut Airbike, created by the same inventor as Jetson ONE, offers a unique design without exposed propellers, but has limited practicality and a high price point of $880,000 (approximately 6.2 million RMB) [25][28].
语言先验「基础过强」,MLLMs 视觉衰减有何解?
机器之心· 2025-11-01 02:30
Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].
单张4090跑到30fps,范浩强团队让VLA实时跑起来了
机器之心· 2025-10-31 07:57
Core Insights - The RT-VLA paper reveals that the VLA model can achieve real-time performance, specifically reaching up to 30 frames per second (fps) on a consumer-grade RTX 4090 GPU with a 3 billion parameter model [2][6] - The researchers have optimized the model's structure, reducing inference time from over 100 milliseconds to as low as 27 milliseconds for dual-view scenarios, significantly outperforming previous results [2][6] - A new algorithm framework has been designed to potentially achieve 480Hz closed-loop control, enabling real-time operation of VLA models [3][12] Model Optimization - The Pi0 model consists of a visual encoder, an encoder, and a decoder, which can be broken down into numerous matrix multiplications and scalar operations [8] - The optimization process involved analyzing the model's inference steps, merging and parallelizing calculations to eliminate bottlenecks, resulting in a streamlined inference time [8][10] - The outcome is a high-performance AI model capable of real-time tasks, likened to a "flash" in terms of speed [8][10] Performance Demonstration - A specific task demonstrated the model's capability to react to a falling pen, achieving an end-to-end response time of under 200 milliseconds, comparable to human performance [10][12] - The framework allows for streaming real-time control of robots, with plans to generate control signals at a maximum frequency of 480Hz [12][15] Future Prospects - The research opens the door to a world where VLA models can participate in real-time control, with potential advancements in edge computing capabilities [14] - Future developments may explore increasing the speed of visual processing beyond 30fps and expanding model sizes while maintaining real-time constraints [15]
刚刚,Kimi开源新架构,开始押注线性注意力
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the advancements in attention mechanisms, particularly focusing on the Kimi Linear architecture, which combines linear attention and full attention to improve efficiency and performance in various tasks [1][2][4]. Group 1: Kimi Linear Architecture - Kimi Linear introduces a new hybrid linear attention architecture called Kimi Delta Attention (KDA), which optimizes memory usage in limited state RNNs through a more efficient gating mechanism [4][10]. - The architecture features a 3:1 ratio of KDA layers to periodic full attention layers, significantly reducing memory usage while maintaining or exceeding the quality of full attention [10][32]. - Kimi Linear has a total of 48 billion parameters, with 3 billion activated parameters, and can handle context lengths of up to 1 million tokens [5][10]. Group 2: Performance and Efficiency - Kimi Linear demonstrates superior performance across various tasks, outperforming traditional full attention methods, especially in long-context tasks, by reducing the need for large key-value caches by up to 75% [5][10]. - The model achieves a decoding throughput that is six times faster than complete multi-head attention models when processing long contexts [5][59]. - In comparative evaluations, Kimi Linear consistently outperforms baseline models like MLA and GDN-H in general knowledge, reasoning, and Chinese tasks [44][49]. Group 3: Technical Innovations - The KDA mechanism introduces fine-grained control over memory decay and position awareness, enhancing the model's expressiveness and efficiency [20][24]. - The architecture employs a block-wise recursive and intra-block parallel strategy to maximize matrix multiplication throughput, leveraging Tensor Cores effectively [26][59]. - The NoPE (No Position Encoding) design in Kimi Linear allows for efficient long-context training by delegating position information responsibilities to KDA layers [34][39]. Group 4: Experimental Results - Kimi Linear achieved the highest average scores in long-context benchmarks, demonstrating its effectiveness in handling extensive sequences [52][53]. - In reinforcement learning scenarios, Kimi Linear showed faster and better performance improvements compared to MLA, particularly in mathematical reasoning tasks [56][57]. - The model's efficiency remains high, with negligible latency overhead compared to GDN-H during pre-filling, while showing significant speed advantages as sequence lengths increase [59][60].
L4大方向有了:理想自动驾驶团队,在全球AI顶会上揭幕新范式
机器之心· 2025-10-31 04:11
Core Viewpoint - The article discusses the transition of AI into its "second half," emphasizing the need for new evaluation and configuration methods for AI to surpass human intelligence, particularly in the context of autonomous driving technology [1][5]. Group 1: AI Paradigm Shift - AI is moving from reliance on human-generated data to experience-based learning, as highlighted by Rich Sutton's paper "The Era of Experience" [1]. - OpenAI's former researcher, Yao Shunyu, asserts that AI must develop new evaluation methods to tackle real-world tasks effectively [1]. Group 2: Advancements in Autonomous Driving - At the ICCV 2025 conference, Li Auto's expert, Zhan Kun, presented a talk on evolving from data closed-loop to training closed-loop in autonomous driving [2][4]. - Li Auto introduced a systematic approach to integrate world models with reinforcement learning into mass-produced autonomous driving systems, marking a significant technological milestone [5]. Group 3: Li Auto's Technological Innovations - Li Auto's advanced driver assistance technology, LiAuto AD Max, is based on the Vision Language Action (VLA) model, showcasing a shift from rule-based algorithms to end-to-end solutions [7]. - The company has achieved significant improvements in its driver assistance capabilities, with a notable increase in the Human Takeover Mileage (MPI) over the past year [9]. Group 4: Challenges and Solutions in Data Utilization - Li Auto identified that the basic end-to-end learning approach faced diminishing returns as the training data expanded to 10 million clips, particularly due to sparse data in critical driving scenarios [11]. - The company aims to transition from a single data closed-loop to a more comprehensive training closed-loop, which includes data collection and iterative training through environmental feedback [12][14]. Group 5: World Model and Synthetic Data - Li Auto is developing a VLA vehicle model with prior knowledge and driving capabilities, supported by a cloud-based world model training environment that incorporates real, synthetic, and exploratory data [14]. - The ability to generate synthetic data has improved the training data distribution, enhancing the stability and generalization of Li Auto's driver assistance system [24]. Group 6: Research Contributions and Future Directions - Since 2021, Li Auto's research team has produced numerous papers, expanding their focus from perception tasks to advanced topics like VLM/VLA and world models [28]. - The company is addressing challenges in interactive intelligent agents and reinforcement learning engines, which are critical for the future of autonomous driving [35][38]. Group 7: Commitment to AI Development - Li Auto has committed nearly half of its R&D budget to AI, establishing multiple teams focused on various AI applications, including driver assistance and smart industrial solutions [43]. - The company has made significant strides in AI technology, with rapid iterations of its strategic AI products, including the VLA driver model launched with the Li Auto i8 [43].
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach that simplifies the reasoning process in large language models (LLMs) by evaluating a completely random policy to find optimal reasoning paths, thus bypassing traditional reinforcement learning (RL) iterations [3][4][11]. Group 1: ROVER's Methodology and Advantages - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving higher quality and diversity in reasoning generation through a minimalist approach [4][9]. - The algorithm eliminates the need for maintaining a value network or a reference model, making it more lightweight compared to traditional RL methods [9][16]. - ROVER's process consists of three simple steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that reduces computational load and enhances stability [19][21][24]. Group 2: Performance Metrics - In high-difficulty tasks such as AIME24, AIME25, and HMMT25, ROVER improved pass@1 by +8.2 and pass@256 by +16.8, showcasing its superior performance [9][26]. - ROVER achieved a pass@1 score of 30.6 on AIME24, surpassing the best baseline (DAPO) by 19.1 points, and a pass@1 score of 14.6 on HMMT25, representing a 106% increase from the highest baseline [26][27]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baselines, allowing it to cover more problem-solving paths [29][31]. Group 3: Implications and Future Directions - The introduction of ROVER reflects a methodological shift, emphasizing that simplification rather than complexity can drive performance improvements in structured tasks [38].
「套壳」的最高境界:OpenAI揭秘Atlas浏览器架构OWL
机器之心· 2025-10-31 03:01
Core Viewpoint - OpenAI's new AI browser, Atlas, is built on a restructured architecture called OWL, which separates the Chromium runtime from the main application process, aiming to enhance browser performance and user experience [1][3][11]. Group 1: Foundation and Architecture - OpenAI emphasizes that Chromium serves as a foundational building block, providing advanced web engine capabilities, security models, performance, and compatibility, supported by a global developer community [5]. - The OWL architecture allows Chromium's browser process to run independently from the Atlas main application process, enhancing modularity and performance [12][14]. - OpenAI's approach involves a complete redesign of the Chromium integration, focusing on rapid development and maintaining engineering culture [10][11]. Group 2: User Experience Enhancements - Atlas aims to redefine the browser experience with features like instant startup speed, smooth performance even with multiple tabs, and a strong foundation for agent scenarios [7]. - The user interface of Atlas is almost entirely rebuilt from scratch, incorporating modern native frameworks rather than merely re-skinning the open-source Chromium interface [9][10]. - The architecture allows for faster loading times, crash isolation, and reduced merge conflicts, facilitating a quicker development cycle [18]. Group 3: Technical Implementation - Atlas operates as an OWL client, while the Chromium browser process acts as the OWL host, communicating through Mojo, a process communication system [17]. - The OWL client library provides a simplified Swift API for key functionalities, ensuring a clean codebase and modern application design [18]. - Input events are captured and forwarded efficiently, maintaining a seamless interaction between the Atlas interface and the Chromium rendering engine [30][32]. Group 4: Agent Mode and Security - The Agent mode in Atlas presents unique challenges, requiring complete screen images for input while ensuring security through sandboxing and session isolation [36][37]. - Each Agent session operates independently, clearing all cookies and data upon completion, allowing multiple concurrent sessions without interference [37]. Conclusion - OpenAI reiterates the critical role of the global Chromium community in enabling these advancements, with OWL paving the way for a decoupled engine and application architecture that combines top-tier web platforms with modern native frameworks [38].
重新定义跨模态生成的流匹配范式,VAFlow让视频「自己发声」
机器之心· 2025-10-31 03:01
Core Viewpoint - The article introduces VAFlow, a novel framework for video-to-audio generation that directly models the mapping from video to audio, overcoming limitations of traditional methods that rely on noise-based priors [6][9][29]. Background - The transition from "noise to sound" to "video to sound" highlights the evolution in multimodal generation tasks, particularly in video-to-audio (V2A) generation [3]. Traditional Methods - Early V2A methods utilized autoregressive and mask-prediction approaches, which faced challenges due to the discrete representation of audio leading to quality limitations [4][5]. VAFlow Framework - VAFlow eliminates the dependency on Gaussian noise priors, enabling direct generation of audio from video distributions, resulting in significant improvements in generation quality, semantic alignment, and synchronization accuracy [6][8][9]. Comparison of Generation Paradigms - The article contrasts traditional diffusion models and flow matching methods with VAFlow, demonstrating that VAFlow achieves better performance in terms of convergence speed and audio quality metrics [19][20]. Prior Analysis - The study compares Gaussian prior and video prior, showing that video prior offers better alignment with audio latent space, leading to superior generation quality [12][15]. Performance Metrics - VAFlow outperforms existing state-of-the-art (SOTA) methods in audio generation quality metrics, achieving the best scores in various benchmarks without complex video conditioning modules [24][25]. Visual Results - The article presents visual comparisons of generated audio from VAFlow against ground truth, illustrating its capability to accurately interpret complex scenes and maintain audio-visual synchronization [27]. Future Directions - The research team plans to explore VAFlow's applications in broader audio domains, including speech and music, indicating its potential for general multimodal generation [29].