稳定训练、数据高效，清华大学提出「流策略」强化学习新方法SAC Flow

Core Insights - The article introduces a new scheme for training flow-based policies using a high data efficiency reinforcement learning algorithm called SAC, which optimizes real flow policies end-to-end without the need for surrogate objectives or policy distillation [2][10]. Group 1: Research Background - Flow-based policies have gained popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion policies, leading to their widespread application in advanced VLA models [4]. - Previous attempts to train flow policies using on-policy RL algorithms have faced challenges, particularly when using data-efficient off-policy RL methods like SAC, which often result in instability due to gradient explosion during multi-step sampling [4][5]. Group 2: Methodology - The proposed approach views the training of flow policies as equivalent to training a recurrent neural network (RNN), allowing the use of modern recurrent structures like GRU and Transformer to stabilize training [7][11]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss of SAC to be expressed using the log-likelihood of multi-step sampling from the flow policy [15]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [16]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [19]. Group 4: Experimental Results - In experiments, both Flow-G and Flow-T achieved state-of-the-art performance in the Mujoco environment, demonstrating stability and high sample efficiency [22][24]. - The results indicate that SAC Flow is robust to the number of sampling steps (K), maintaining stable training across various K values, with Flow-T showing particularly strong robustness [30]. Group 5: Comparison with Similar Works - Unlike FQL/QC-FQL, which distill flow policies into single-step models before off-policy RL training, SAC Flow retains the modeling capabilities of flow policies without distillation [33]. - SAC Flow-T and Flow-G exhibited faster convergence and higher final returns across various environments compared to diffusion policy baselines and other flow-based methods [34][35]. Group 6: Conclusion - The key attributes of SAC Flow are serialization, stable training, and data efficiency, leveraging the experience of GRU and Transformer structures to stabilize gradient backpropagation [37].