如何看待目前VLA的具身智能技术？VLA还算是弱智人？

Core Viewpoint - The article critiques the VLA (Vision-Language-Action) framework, arguing that it is fundamentally flawed and overly simplistic, primarily focusing on trivial tasks that do not reflect real-world complexities [1][18]. Group 1: VLA Framework Limitations - VLA is essentially an upgraded version of BC (Behavior Cloning) with minimal innovation, leading to misleading success rates [1][2]. - The tasks selected for VLA are overly simplistic, often limited to basic pick-and-place actions, which do not demonstrate true versatility or effectiveness [3][4]. - The framework's reliance on 2D scenarios fails to account for the 3D nature of real-world environments, limiting its applicability [10][11]. Group 2: Data and Performance Issues - VLA requires an excessive amount of data for simple tasks, undermining its efficiency and practicality [14][15]. - The success rates reported for VLA tasks are artificially inflated due to the simplicity of the tasks chosen, with claims of 100% success being misleading [5][6]. - The framework lacks clarity on its capabilities, making it difficult to determine what tasks it can perform at various stages of development [16][17]. Group 3: Overall Critique - The article argues that VLA represents a superficial approach to AI, lacking depth in understanding and modeling real-world tasks and environments [18][19]. - The author expresses frustration with the lack of meaningful progress in VLA, suggesting that it is a product of laziness and opportunism within the AI community [18][20].