从 2D 感知到 3D 预测：GeoPredict 重构VLA模型的几何推理能力

Core Insights - The article discusses the GeoPredict framework, which addresses limitations in existing Visual-Language-Action (VLA) models by integrating predictive kinematics and 3D Gaussian geometry for enhanced robotic manipulation capabilities [2][3][17]. Group 1: Technical Challenges - Existing VLA models are limited by three core challenges: lack of spatial modeling, insufficient long-term prediction, and efficiency contradictions in inference [3][4][5]. - Traditional models operate in a 2D-centric reactive decision-making paradigm, failing to provide explicit 3D geometric modeling necessary for precise task execution [3]. - Reactive strategies rely on instantaneous observations, which do not capture motion inertia and dynamic scene evolution, making them inadequate for long-term manipulation tasks [4]. Group 2: GeoPredict Framework Design - GeoPredict employs a three-layer technical architecture: kinematic prediction, geometric modeling, and attention fusion, which injects future-aware geometric priors into VLA models without increasing inference burden [6]. - The first layer focuses on trajectory-level kinematic prediction, capturing future inertia by encoding motion history and predicting multi-step trajectories [8]. - The second layer utilizes predictive 3D Gaussian geometry to model dynamic scene evolution effectively [8]. - The third layer implements block-level causal attention to ensure efficient information flow across different types of tokens [8]. Group 3: Performance Validation - GeoPredict has demonstrated superior performance across various benchmarks, significantly surpassing existing methods in both simulated and real-world tasks [10][14]. - In the RoboCasa benchmark, GeoPredict achieved an average success rate of 52.4%, improving by 10.1% over baseline models [10]. - In real-world experiments, GeoPredict achieved success rates of 85.0% in spatial tasks, 95.0% in geometric tasks, and 90.0% in robustness tasks, showcasing its 3D reasoning capabilities [18]. Group 4: Future Directions - The framework has potential for expansion, including integrating multi-attribute Gaussian representations and optimizing real-time performance through model compression techniques [17][18]. - Future work may also explore adaptive prediction horizons to balance long-term task performance while addressing cumulative error issues [18].