CogACT
Search documents
告别机器人“断片”!KAIST和UC Berkeley团队让VLA模型拥有记忆 实测成功率翻倍!
机器人大讲堂· 2026-02-16 15:31
Core Insights - The article discusses the limitations of existing Visual-Language-Action (VLA) models in robotics, particularly their lack of "historical memory," which hampers their ability to perform complex tasks that require context [1][4] - A new framework called HAMLET has been introduced, which enhances VLA models by integrating a lightweight memory system, resulting in a significant increase in task success rates [3][17] Group 1: Current Limitations of VLA Models - Current VLA models, such as GR00T N1.5 and CogACT, rely solely on the current visual frame and text instructions, leading to poor performance in tasks requiring context [4] - For example, in a task where a robot must cover a block with a cup, the lack of historical memory results in a success rate of only 37.5% for GR00T N1.5, causing the robot to repeat actions unnecessarily [4][14] - Simply stacking historical frames does not work effectively, as it slows down inference speed by 35% and increases peak memory usage by 3.6 times [4] Group 2: HAMLET Framework - HAMLET addresses the historical memory gap by adding two core components: moment tokens and a lightweight memory module [5][9] - Moment tokens are designed to compress and store scene information for each time step, allowing the model to focus on dynamic changes relevant to the task [6][8] - The memory module uses a two-layer Transformer architecture to filter and integrate these moment tokens, enabling the model to make more informed decisions based on historical context [9][11] Group 3: Performance Improvements - Extensive experiments show that HAMLET significantly improves success rates in long-term tasks, with an average success rate increase of 47.2% compared to baseline models [12][14] - In specific tasks, HAMLET improved the success rate from 12.5% to 66.7% in "Pick-and-Place Twice" and from 37.5% to 83.3% in "Swap Cubes" [14] - HAMLET also maintains high efficiency, with only a 7% increase in inference speed and a 1x increase in memory usage, compared to traditional methods that drastically slow down performance [15] Group 4: Cross-Task Transferability - The memory module of HAMLET demonstrates cross-task transferability, allowing it to improve success rates even when applied to different datasets, indicating a generalizable capability in processing historical information [16] Conclusion - HAMLET effectively resolves the core issue of historical memory in VLA models without requiring extensive retraining or restructuring, marking a significant step towards more capable and versatile robotic systems [17]
具身走向现实世界!RoboChallenge:从仿真到实体,全球首个大规模多任务真机任务基准
具身智能之心· 2025-10-15 11:03
Core Insights - The article discusses the launch of RoboChallenge, a large-scale, multi-task benchmark testing platform for embodied intelligence, initiated by Dexmal and Hugging Face, aimed at addressing the lack of real machine testing in the field [5][41]. Group 1: Challenges in the Embodied Intelligence Field - The embodied intelligence sector has seen rapid advancements, but the absence of real machine testing and limitations of existing evaluation systems have become significant bottlenecks [3][4]. - Current mainstream benchmarks primarily rely on simulation environments, leading to issues where algorithms that perform well in simulations fail in real-world applications [4][10]. Group 2: Introduction of RoboChallenge - RoboChallenge is the first large-scale benchmark testing platform that allows real robots to perform tasks in a physical environment, providing a more reliable and comparable evaluation standard for visual language action models (VLAs) [5][10]. - The platform aims to overcome challenges related to performance validation in real environments, standardized testing conditions, and accessibility [5][10]. Group 3: Features of RoboChallenge - RoboChallenge features a "remote robot" paradigm, allowing users to interact with real machines without needing hardware, thus lowering the entry barrier for researchers and developers [15][19]. - The platform supports a wide range of tasks, with an initial benchmark set (Table30) comprising 30 diverse tasks designed to evaluate core capabilities of VLA models [12][26]. Group 4: Evaluation Mechanism - The evaluation mechanism combines end-to-end task success rates with process scoring, ensuring a rigorous and transparent assessment of models [16][20]. - RoboChallenge employs a "visual input matching" method to ensure consistency in testing conditions, reducing variability caused by human testers [23][25]. Group 5: Open and Collaborative Ecosystem - RoboChallenge promotes an open ecosystem by providing free access to evaluation services, publicly sharing task demonstration data, and ensuring transparency in results [34][41]. - The platform encourages collaboration among researchers, developers, and industry professionals, fostering innovation in the field of embodied intelligence [38][41]. Group 6: Future Directions - RoboChallenge plans to expand its capabilities by introducing more robot types and challenging tasks, aiming to enhance the evaluation of embodied intelligence in real-world scenarios [42].