冷静看待VLA：不是救世主，也不是“垃圾”

Core Viewpoint - The article critiques the VLA (Visual Language Agent) approach, emphasizing that while it has merits, it also has significant limitations that need to be addressed for better performance in complex environments [1]. Group 1: Challenges and Limitations - The main challenge lies in enabling models to generalize effectively [2]. - Current models struggle in complex environments due to simplistic task settings, often limited to "grab-and-drop" scenarios with minimal obstacles [6]. - The reliance on large datasets and the black-box nature of systems hinder understanding of model capabilities [6]. Group 2: Proposed Solutions - A focus on designing effective subgoal embeddings is crucial for ensuring generalization, potentially using cross-attention mechanisms to link task text tokens with image patch tokens [3][4]. - The article suggests that learning-based methods may outperform traditional methods in complex environments, as they can adapt to visual observation errors and continuously correct actions [4]. - An explicit VLA approach is recommended, where large models break down tasks into subgoals, allowing for clearer structure and reduced training requirements [8].