模拟大脑功能分化！北大与港中文发布Fast-in-Slow VLA，让“快行动”和“慢推理”统一协作

Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].