Workflow
SmolVLA
icon
Search documents
AnywhereVLA:在消费级硬件上实时运行VLA
具身智能之心· 2025-09-29 02:08
Core Background and Objectives - The current mobile operation technology is expanding from closed, structured work units to open, unstructured large indoor environments, requiring robots to explore unfamiliar and cluttered spaces, interact with diverse objects and humans, and respond to natural language commands for tasks such as home service, retail automation, and warehousing logistics [3] - AnywhereVLA proposes a modular architecture that integrates the robustness of classical navigation with the semantic understanding capabilities of VLA models to achieve language-driven pick-and-place tasks in unknown large indoor environments, capable of real-time operation on consumer-grade hardware [3] Review of Existing Solutions: Advantages and Limitations - VLA models and lightweight optimization strategies are discussed, highlighting their limitations in spatial perception and adaptability to large environments [4] - Existing solutions like MoManipVLA and SmolVLA show performance close to larger models while reducing resource requirements, but they lack spatial awareness for large environments [4] - The limitations of visual-language navigation (VLN) and classical navigation frameworks are outlined, emphasizing the need for improved language understanding and semantic reasoning capabilities [4] AnywhereVLA Architecture: Four Core Modules and Workflow - The AnywhereVLA architecture processes natural language commands through four modules to output low-level control instructions for driving base wheels and robotic arm joints [4] - The workflow includes language instruction parsing, guiding VLA operations, constructing 3D semantic maps, and executing operations based on the identified targets [7] VLA Model Fine-tuning and Hardware Platform - The SmolVLA model is fine-tuned to enhance its operational capabilities, with specific input data and key steps outlined for optimizing performance [13][15] - The HermesBot mobile operation platform is designed specifically for AnywhereVLA, balancing sensing and computational capabilities [16] Experimental Results: Performance and Effectiveness Validation - In an unknown multi-room laboratory setting, 50 pick-and-place tasks were executed, with a core success rate of 46%, and the fine-tuned SmolVLA operation module achieving an 85% success rate [17][22] - The performance metrics for various modules are provided, indicating robust SLAM performance and varying success rates for active environment exploration, navigation, object detection, and VLA manipulation [22] - Time efficiency metrics show that the average task completion time is under 133 seconds for a 5m exploration radius, meeting real-time scene requirements [23]
VLA-Adapter:以0.5B参数实现机器人智能新高度,还无需预训练
具身智能之心· 2025-09-17 03:14
Core Viewpoint - The VLA-Adapter model, developed by leading institutions, represents a revolutionary breakthrough in the Vision-Language-Action (VLA) model for robotics, offering a lightweight design with 500 million parameters while achieving performance comparable to larger models, thus lowering the barriers for training and deployment in robotic applications [4][11][30]. Summary by Sections Introduction to VLA-Adapter - The VLA-Adapter model has been jointly developed by top institutions and is designed to enhance the efficiency and intelligence of robots in understanding environments and executing tasks [4][11]. Challenges in VLA Models - Traditional VLA models face challenges such as reliance on large-scale pre-trained models and high computational costs, which hinder practical applications [3][11]. VLA-Adapter's Innovations - VLA-Adapter introduces a new bridging paradigm that efficiently transmits multimodal information to the action space, significantly reducing model size and training costs [11][12]. - The model utilizes a lightweight backbone network with only 0.5 billion parameters, achieving performance comparable to 7 billion parameter models without requiring extensive pre-training on robotic datasets [11][12]. Key Technologies - The innovative Bridge Attention mechanism is crucial for VLA-Adapter's success, allowing efficient connection between visual-language representations and action generation [12][14]. - The model's training efficiency is highlighted by its ability to complete training in just 8 hours on a single consumer-grade GPU, compared to traditional models that may take days or weeks [15][19]. Experimental Validation - VLA-Adapter has demonstrated superior performance in various robotic tasks, achieving an average success rate of 97.3% in the LIBERO benchmark, outperforming several baseline models [19][20]. - In zero-shot generalization tasks, VLA-Adapter achieved an average task completion length of 4.42, indicating strong adaptability to unseen environments [21][22]. Real-World Applications - The model has shown robust performance in real-world tasks, including complex operations with a 6-DOF robot, demonstrating its potential for diverse applications in industrial automation, smart homes, and medical assistance [23][28]. Future Potential - VLA-Adapter's lightweight design and high efficiency position it as a promising solution for real-time applications, facilitating the development and deployment of VLA models by smaller research institutions and companies [28][30].
GPT重大更新,Hugging Face发布开源机器人AI模型
Mei Ri Jing Ji Xin Wen· 2025-06-05 00:57
周三(2025年6月4日),截至收盘,科创人工智能ETF华夏(589010)上涨0.2%,持仓股方面,奥普特 上涨4.65%领涨,有方科技上涨2.96%、金山办公上涨2.72%涨幅靠前;机器人ETF(562500)上涨 0.6%,持仓股方面,亿嘉和上涨5.65%领涨,奥普特上涨4.65%、绿的谐波上涨4.61%涨幅靠前。当日交 易金额4.41亿元,居相同标的ETF首位,换手率3.43%,市场成交活跃。 【市场复盘】 【热点要闻】 1.6月5日凌晨1点,OpenAI开始技术直播对ChatGPT进行了重大更新,包括向macOS用户推出ChatGPT会 议记录模式,可以转录任何会议、头脑风暴或语音笔记,并快速提取要点然后转化为新的内容。另外一 个重要功能就是ChatGPT正式支持MCP协议,例如,直接连接Github、SharePoint等常用工具,实现跨 平台数据整合、搜索和推理。简单来说,OpenAI希望把ChatGPT打造成智能协作平台。 2.6月4日,OpenAI宣布,公司的付费企业用户已突破300万,较2月份报告的200万实现了爆发式增长, 并进行了一些产品的更新与升级。据OpenAI介绍,这300万用户 ...