直指具身智能核心瓶颈，千寻智能高阳团队提出 Point-VLA：首次以视觉定位实现语言指令精准执行

Core Insights - The article discusses the limitations of traditional Vision-Language-Action (VLA) models in accurately interpreting complex spatial instructions and proposes a new method called Point-VLA to overcome these challenges [5][27]. Group 1: Limitations of Traditional VLA Models - Language often fails to express certain spatial scenarios accurately, leading to ambiguity in communication [6][8]. - Even when detailed descriptions are provided, VLA models struggle to generalize and execute complex spatial commands, resulting in low success rates [7][20]. - Advanced Visual-Language Models (VLM) can achieve 60-70% accuracy in locating targets based on complex text descriptions, but text-only VLA models have a success rate of only around 25% [14][9]. Group 2: Introduction of Point-VLA - Point-VLA introduces visually grounded instructions by overlaying bounding boxes on images, allowing robots to understand commands more intuitively, similar to human pointing [10][11]. - This method combines high-level intentions expressed in language with precise spatial information encoded visually, enhancing the model's performance [12][15]. Group 3: Experimental Results - Point-VLA achieved an impressive average success rate of 92.5% across various challenging tasks, significantly outperforming the 32.4% success rate of traditional text-only VLA models [20][19]. - In specific tasks, such as cluttered scene grasping, Point-VLA's success rate improved from 43.3% to 94.3%, demonstrating its effectiveness in real-world applications [20][23]. Group 4: Data Annotation and Scalability - The development of an automated data annotation pipeline allows for efficient generation of visual grounding signals, reducing the cost of acquiring training data [18][27]. - As training data increases, Point-VLA's performance continues to improve, while traditional text-only VLA models reach a performance plateau [25][30]. Group 5: Implications for Future Development - Point-VLA addresses a fundamental issue in the VLA field by bypassing the limitations of language expression, paving the way for new advancements in VLA models [27]. - The demonstrated capabilities of Point-VLA provide a technical foundation for practical applications in industrial and service sectors, highlighting the effectiveness of human-like interaction methods in human-robot collaboration [27][29].