视觉语言模型(VLA)
Search documents
让机器人“舞得更好”的全身运控的方案还有哪些进化空间?
具身智能之心· 2026-01-04 00:32
Core Insights - The article discusses advancements in reinforcement learning (RL) and its integration with various models, particularly in the context of embodied intelligence and robotics. It highlights the importance of data quality for pretraining models and the innovative approaches being developed to enhance RL training paradigms [3][4][5]. Group 1: Reinforcement Learning Innovations - The discussion emphasizes the standardization of training paradigms in RL, particularly the use of imitation learning followed by reinforcement learning in simulated environments [3][4]. - A significant point raised is the introduction of the Simple Policy Optimization (SPO) algorithm, which has been recognized in the context of the Pi0.6 model, showcasing its application as a baseline for RL tasks [3][4]. - The article notes that the data used for pretraining models in different domains, such as language models and autonomous driving, varies significantly, affecting the quality and applicability of the models [4][5]. Group 2: Data Utilization and Challenges - The article highlights the challenge of utilizing real-world driving data for pretraining, noting that only about 1% of collected data is suitable for model training due to various imperfections [4][5]. - It discusses the potential of RL to evaluate and utilize suboptimal data, suggesting that even flawed data can contribute to learning processes, akin to how humans learn from mistakes [5][6]. - The need for effective data collection and utilization strategies in the field of embodied intelligence is emphasized, particularly in light of the high volume of discarded data during training processes [5][6]. Group 3: Framework Development - The article introduces the Rlinf framework, designed to support RL applications in visual language models (VLA), addressing the limitations of existing frameworks that do not cater to the specific needs of RL in VLA contexts [8][10]. - The framework aims to facilitate various RL methodologies, including on-policy and off-policy learning, and is built to accommodate diverse hardware requirements [10][11]. - The development of this framework is seen as a significant investment, reflecting the growing demand for robust RL tools in the field of embodied intelligence [10][11]. Group 4: Sim-to-Real Transfer and Practical Applications - The article discusses the challenges of sim-to-real transfer in robotics, particularly in tasks involving local motion and manipulation, where the gap between simulated and real-world performance remains substantial [19][29]. - It highlights the exploration of 3D generative models as a means to improve the realism of simulations, thereby enhancing the effectiveness of RL training [24][25]. - The integration of advanced perception technologies, such as dual-camera systems, is noted as a promising approach to bridge the sim-to-real gap, facilitating better performance in real-world applications [22][29].
理想连发两篇VLA机器人论文
理想TOP2· 2025-12-02 07:29
2025年11月24日理想发布Compressor-VLA与AVA-VLA。 Compressor-VLA提出了一种针对机器人操作场景的高效视觉压缩方案,旨在解决端到端模型太重、太慢的落地难题。教会了机器人带着目的去观察,通 过语言指令过滤掉视觉垃圾,用更少的算力实现更精准的操作。 现在的具身智能大模型存在极其严重的算力浪费: 像一个强迫症患者,花费巨量算力去处理背景墙纸、地板纹理等无关紧要的视觉信息,导致推理延迟 过高,无法满足机器人的实时控制需求。 反直觉现象: 为了给模型减肥,传统做法是直接丢弃视觉token。这种盲目剪枝会导致灾难性后果——模型为了降低计算量,可能会保留纹理清晰的桌布 图案,丢弃了模糊但至关重要的物体边缘或把手位置。 现有的压缩算法是任务无关(Task-Agnostic)的。也就是说,压缩器是个瞎子,它只看图片本身,根本不知道机器人当下的任务是拿苹果还是关抽屉。这导 致在压缩过程中,关键的任务线索被当做噪声误删了 。 AVA-VLA针对现有端到端机器人大模型最核心的健忘问题,提出了一套工程化解决方案。 以下为更细化论述: 解决方案框架 采用了一种双通道互补结构,类似于指挥官 + 工匠的 ...
理想认为VLA语言比视觉对动作准确率影响更大
理想TOP2· 2025-08-16 12:11
Core Viewpoint - The article discusses the release of DriveAction, a benchmark for evaluating Visual-Language-Action (VLA) models, emphasizing the need for both visual and language inputs to enhance action prediction accuracy [1][3]. Summary by Sections DriveAction Overview - DriveAction is the first action-driven benchmark specifically designed for VLA models, containing 16,185 question-answer pairs generated from 2,610 driving scenarios [3]. - The dataset is derived from real-world driving data collected from mass-produced assisted driving vehicles [3]. Model Performance Evaluation - The experiments indicate that the most advanced Visual-Language Models (VLMs) require guidance from both visual and language inputs for accurate action predictions. The average accuracy drops by 3.3% without visual input, 4.1% without language input, and 8.0% when both are absent [3][6]. - In comprehensive evaluation modes, all models achieved the highest accuracy in the full V-L-A mode, while the lowest accuracy was observed in the no-information mode (A) [6]. Specific Task Performance - Performance metrics for specific tasks such as navigation, efficiency, and dynamic/static tasks are provided, showing varying strengths among different models [8]. - For instance, GPT-4o scored 66.8 in navigation-related visual questions, 75.2 in language questions, and 78.2 in execution questions, highlighting the diverse capabilities of models [8]. Stability Analysis - Stability analysis was conducted by repeating each setting three times to calculate average values and standard deviations. GPT-4.1 mini and Gemini 2.5 Pro exhibited strong stability with standard deviations below 0.3 [9].