Workflow
KL divergence
icon
Search documents
多模态大模型强化学习训练框架 - EasyR1代码走读(GRPO)
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the exploration of the EasyR1 framework for multi-modal reinforcement learning, particularly focusing on its implementation and configuration for training models like Qwen2.5-VL [1][4][6]. Group 1: Framework Overview - EasyR1 is derived from the verl framework and is designed for language-based reinforcement learning [1][6]. - The code version referenced is approximately from June 10, indicating ongoing updates and improvements [1]. Group 2: Configuration Details - The configuration file is structured into four main categories: data, algorithm, worker, and trainer, with specific parameters outlined for each [6][11]. - Data configurations include paths for training and validation files, maximum prompt and response lengths, and batch sizes for training iterations [9][10]. - Algorithm configurations specify parameters for the advantage estimator, discount factors, and KL divergence settings [11][13]. Group 3: Training Workflow - The training process is initiated through a main script that sets up the data loaders and begins the training loop [42][43]. - The workflow includes steps for preparing data, generating sequences, and computing rewards, with specific attention to balancing batch sizes across distributed processes [46][50][64]. - The article emphasizes the importance of handling multi-modal data and ensuring that the training process accommodates various input types [65][66]. Group 4: Data Handling - The dataset must include specific keys such as problem, answer, and images, formatted in JSON for compatibility with the loading functions [40][41]. - The data loading process supports multiple file formats and is designed to create a seamless pipeline for training [41][32]. Group 5: Model Update Mechanism - The article outlines the mechanism for updating the actor model, detailing how policy loss is computed and how gradients are managed during training [82][86]. - It highlights the significance of KL divergence in the training process, particularly in relation to the reference model [71][80].