误差累积
Search documents
Physical Intelligence最新发布的VLA模型,为什么是机器人通往规模化部署的拐点?|Jinqiu Select
锦秋集· 2025-11-18 11:13
Core Insights - The article discusses the limitations of current robot foundational models that primarily rely on demonstration data, highlighting the need for a structured reinforcement learning (RL) framework called Recap to enhance robot performance and reliability [2][3][10]. Group 1: Limitations of Current Models - Current models depend heavily on demonstration data, which incurs high human costs and limits the strategies to human-level performance, lacking self-improvement capabilities [2][10]. - The article emphasizes that merely increasing model size is insufficient; a restructured training paradigm is essential for robots to transition from "can demonstrate" to "can deploy at scale" [3][10]. Group 2: Introduction of Recap Framework - Recap integrates three training phases: demonstration, correction, and robot autonomous rollouts, allowing for continuous improvement in strategy quality [2][10]. - The framework addresses the compounding error problem in robot strategies by systematically utilizing correction data, value functions, and advantages [3][10][12]. Group 3: Performance of π*(0.6) Model - The π*(0.6) model, with 5 billion parameters, demonstrates the ability to handle heterogeneous prompts and achieve performance thresholds suitable for commercial deployment [3][20]. - The model shows significant improvements in task execution, achieving over 90% success rates in complex tasks such as making espresso, folding clothes, and assembling boxes [25][20]. Group 4: Learning Process and Challenges - The learning process involves three stages: offline reinforcement learning pre-training, task-specific fine-tuning, and continuous improvement through real-world experience [19][20]. - The article outlines the challenges faced in high-throughput, autonomous execution, particularly in tasks requiring complex physical operations and adaptability to various conditions [24][20]. Group 5: Data Sources for Learning - The article identifies three data sources for robot learning: expert demonstrations for defining new behaviors, guidance for refining strategies, and autonomous experience for behavior enhancement [27][28]. - It posits that autonomous experience may become a crucial data source as robots are deployed more widely in real-world applications, potentially enabling performance that surpasses human capabilities [27][28].
让AI生成视频「又长又快」:Rolling Forcing实现分钟级实时生成
机器之心· 2025-11-05 00:18
Core Insights - The article discusses a breakthrough in real-time long video generation through a new method called Rolling Forcing, developed by researchers from Nanyang Technological University and Tencent ARC Lab [2][4][12]. Group 1: Challenges in Real-Time Video Generation - Real-time long video generation faces a "impossible triangle" dilemma, where high quality, consistency, and real-time performance are difficult to achieve simultaneously [8]. - The core challenges include the need for sequential frame generation with low latency, the difficulty in eliminating error accumulation while maintaining consistency, and the limitations of self-regressive frame generation methods [10][11]. Group 2: Rolling Forcing Methodology - Rolling Forcing introduces a "sliding window" approach that allows for parallel processing of frames within a window, enabling real-time generation while correcting errors as they occur [12][14]. - The method incorporates three key innovations: 1. A sliding window for joint denoising, optimizing multiple frames simultaneously [14]. 2. An Attention Sink mechanism to ensure long-term consistency by caching initial frames as global anchors [14]. 3. An efficient training algorithm that uses self-generated historical frames to simulate real inference scenarios [14]. Group 3: Experimental Results - Rolling Forcing demonstrates significant improvements over existing methods, achieving a generation speed of 16 frames per second (fps) while maintaining low error accumulation [17][20]. - In qualitative comparisons, Rolling Forcing maintains high fidelity in long video generation, avoiding issues like color drift and detail degradation that affect other models [20][21]. Group 4: Future Directions - Future research may focus on optimizing memory mechanisms for better retention of key information, improving training efficiency to reduce computational costs, and minimizing interaction delays for applications requiring ultra-low latency [25].
大神Karpathy都投的AI实时视频生成模型:直播都能立即转,无限时长几乎零延迟
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses the innovative AI startup Decart and its groundbreaking video model MirageLSD, which enables real-time, zero-latency video generation, revolutionizing live streaming, gaming, and video communication [4][5][7]. Group 1: Technology and Features - MirageLSD is the first AI model to achieve zero-latency, infinite real-time video generation, allowing for continuous video streams without time limitations [4][5]. - The model operates at a speed 16 times faster than previous models, generating video at 24 frames per second and allowing for ongoing prompts, transitions, and edits during video generation [6][28]. - It addresses the "error accumulation" issue found in traditional autoregressive video models, ensuring temporal coherence while generating content frame by frame [9][11]. Group 2: Innovations and Mechanisms - The model employs a custom real-time stream diffusion model (Live-Stream Diffusion) that generates each frame based on previously generated frames and user prompts, rather than relying on the entire video sequence [14]. - It utilizes Diffusion Forcing technology to independently denoise single frames during training, ensuring coherence in frame generation [15]. - The model incorporates a historical enhancement strategy to preemptively correct potential errors by simulating artifacts during training [16]. Group 3: Performance and User Interaction - MirageLSD's architecture includes an improved Transformer model and a specially designed visual encoder, which enhances processing speed and reduces latency [18][20]. - The system features a dynamic input mechanism that processes player inputs with ultra-low latency, allowing for immediate responses to changes in the environment [22]. - Users can perform actions like changing outfits or transforming objects with minimal delay, showcasing the model's interactive capabilities [23]. Group 4: Company Background and Future Developments - Decart, the company behind MirageLSD, was founded in 2023 and previously launched the Oasis model, which also supports real-time interactions [25][26]. - The team plans to regularly release upgrades and new features for MirageLSD, including facial consistency, voice control, and precise object manipulation to enhance user experience [28].