流策略

Search documents
稳定训练、数据高效,清华大学提出「流策略」强化学习新方法SAC Flow
具身智能之心· 2025-10-20 00:03
Core Viewpoint - The article introduces a new approach called SAC Flow, which utilizes a high data efficiency reinforcement learning algorithm to train flow-based policies end-to-end without the need for alternative objectives or policy distillation. This method achieves high data efficiency and state-of-the-art performance on various benchmarks [1][4][20]. Group 1: Research Background - Flow-based policies are gaining popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion strategies. They are widely used in advanced VLA models [4]. - Previous attempts to train flow policies using off-policy reinforcement learning (RL) often faced issues such as gradient explosion due to the multi-step sampling process inherent in flow policies [4][5]. Group 2: Methodology - The proposed SAC Flow treats flow policies as sequential models, allowing the use of modern recurrent structures like GRU and Transformer to stabilize training and optimize flow policies directly within an off-policy framework [7][10]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss to be expressed using the log-likelihood of multi-step sampling from the flow policy [14]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [18]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [18][20]. Group 4: Experimental Results - SAC Flow-T and Flow-G demonstrated stable and faster convergence in environments like Hopper, Walker2D, and Ant, achieving state-of-the-art performance [20][21]. - The offline-to-online training results showed that SAC Flow maintains stable gradients and prevents gradient explosion, leading to superior performance compared to naive SAC training [24][26]. Group 5: Comparison with Similar Works - SAC Flow outperforms existing methods like FlowRL and diffusion strategies in terms of convergence speed and efficiency, particularly in challenging sparse-reward tasks [30][31]. - The method retains the modeling capabilities of flow policies without the need for distillation into single-step models, which is a common approach in other methods [31]. Group 6: Key Takeaways - The key attributes of SAC Flow are serialization, stable training, and data efficiency, enabling the direct use of off-policy RL algorithms to train flow policies effectively [32].
稳定训练、数据高效,清华大学提出「流策略」强化学习新方法SAC Flow
机器之心· 2025-10-18 05:44
Core Insights - The article introduces a new scheme for training flow-based policies using a high data efficiency reinforcement learning algorithm called SAC, which optimizes real flow policies end-to-end without the need for surrogate objectives or policy distillation [2][10]. Group 1: Research Background - Flow-based policies have gained popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion policies, leading to their widespread application in advanced VLA models [4]. - Previous attempts to train flow policies using on-policy RL algorithms have faced challenges, particularly when using data-efficient off-policy RL methods like SAC, which often result in instability due to gradient explosion during multi-step sampling [4][5]. Group 2: Methodology - The proposed approach views the training of flow policies as equivalent to training a recurrent neural network (RNN), allowing the use of modern recurrent structures like GRU and Transformer to stabilize training [7][11]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss of SAC to be expressed using the log-likelihood of multi-step sampling from the flow policy [15]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [16]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [19]. Group 4: Experimental Results - In experiments, both Flow-G and Flow-T achieved state-of-the-art performance in the Mujoco environment, demonstrating stability and high sample efficiency [22][24]. - The results indicate that SAC Flow is robust to the number of sampling steps (K), maintaining stable training across various K values, with Flow-T showing particularly strong robustness [30]. Group 5: Comparison with Similar Works - Unlike FQL/QC-FQL, which distill flow policies into single-step models before off-policy RL training, SAC Flow retains the modeling capabilities of flow policies without distillation [33]. - SAC Flow-T and Flow-G exhibited faster convergence and higher final returns across various environments compared to diffusion policy baselines and other flow-based methods [34][35]. Group 6: Conclusion - The key attributes of SAC Flow are serialization, stable training, and data efficiency, leveraging the experience of GRU and Transformer structures to stabilize gradient backpropagation [37].