不用VLA！从视频生成模型到机器人控制

Core Insights - The article discusses a new paradigm in embodied intelligence, focusing on the use of video generation for robot control, specifically through a model called LVP (Large Video Planner) [8][12][18]. Group 1: Model Architecture and Contributions - The LVP model consists of 14 billion parameters and is designed for embodied decision-making, utilizing video data to enhance robot control capabilities [18]. - The model leverages vast amounts of human activity videos available online, which contain rich information about physical interactions, rather than relying solely on scarce high-quality robot action data [11][19]. - Key innovations include the introduction of Diffusion Forcing and History Guidance techniques to improve video generation accuracy and coherence, ensuring that generated videos are physically consistent and relevant to the robot's current state [24][26]. Group 2: Data Set and Training - The LVP-1M dataset, comprising approximately 1.4 million video clips, was specifically constructed for training the model, incorporating diverse sources such as robot data, egocentric human data, and general internet videos [29][30]. - The dataset includes various types of interactions and scenarios, enhancing the model's ability to generalize across different tasks and environments [30][31]. Group 3: Action Extraction and Execution - A visual action extraction pipeline was developed to translate generated videos into actionable robot movements without requiring additional training [32]. - The pipeline includes detailed action descriptions and aligns the timing of robot movements with human actions to ensure smooth execution [34]. Group 4: Performance and Testing - The LVP model demonstrated superior performance in real-world tasks compared to existing video generation models and robot strategy models, achieving higher success rates in novel tasks [41][42]. - The model's zero-shot generalization ability allows it to perform tasks it has never encountered before, such as tearing tape and scooping coffee beans, showcasing its adaptability [42]. Group 5: Limitations and Future Directions - The article acknowledges limitations such as slow video generation times, reliance on external components for action extraction, and the challenges of open-loop execution [48]. - Future developments aim to enhance the model's real-time closed-loop control capabilities and further improve its understanding of the physical world [48].