Workflow
MixGRPO
icon
Search documents
产业观察:【AI产业跟踪】 Qwen开源4B端侧模型
【AI产业跟踪】 Qwen 开源 4B 端侧模型 摘要:产业最新趋势跟踪,点评产业最新风向 百度发布智能云数字员工 产业研究中心 | ર્ | 李嘉琪(分析师) | | --- | --- | | క | 010-83939821 | | D | lijiaqi2@gtht.com | | 登记编号 | S0880524040001 | | S | 刘峰(研究助理) | | ર | 0755-23976068 | | E | liufeng6@gtht.com | | 登记编号 | S0880124060013 | 在期回顾 【新材料产业周报】 旭化成与丰田签约向其供应 锂电池隔膜,天羿鹏博等多家新材料企业完成融 O AI 科技前沿 全球首个人形机器人通用视觉感知系统 Humanoid Occupancy 发布 类脑计算机"悟空"发布 全球首个高分辨率三维无线电地图数据集及扩散式建图框架发布 0 风险提示 AI 软件销售不及预期,capex 投资计划变动,AI 产品及大模型研发不及预期 | 往电池间肤,入井鹏将于夕茶制 付在正正成成 资 2025.08.11 | | --- | | 长征十二号运载火箭发射成功 ...
产业观察:【AI产业跟踪】字节开源AI Agent Coze
AI Industry Trends - ByteDance has open-sourced its AI Agent "Coze," which supports commercial use and has over 6,000 stars on GitHub, providing a platform for developing intelligent agents without coding[14] - The "Step 3" model by Jieyue features 321 billion total parameters and 38 billion activated parameters, achieving a 300% inference efficiency compared to DeepSeek-R1, with expected revenue of nearly $1 billion in 2025[11] - Ant Group released the financial reasoning model "Agentar-Fin-R1," which outperforms similar models in multiple financial evaluations and is based on a comprehensive financial dataset[16] AI Applications and Platforms - SenseTime launched the "Wuneng" embodied intelligence platform, featuring a multimodal reasoning model that improves cross-modal reasoning accuracy by 5 times compared to Gemini 2.5 Pro[8] - Huawei introduced the AI-Box platform, designed for lightweight edge deployment, supporting local execution of multimodal large models with low power consumption[9] - Tencent's Tairos platform offers modular services for multimodal perception and planning, focusing on enhancing robotic software capabilities[10] AI Model Developments - Zhiyuan released the GLM-4.5 model, which integrates reasoning, programming, and agent capabilities, achieving top performance in global open-source model benchmarks[17] - JD Cloud announced the open-source enterprise-level intelligent agent "JoyAgent," which supports multi-agent collaboration and has been tested in over 20,000 internal applications[18] - ByteDance and Nanjing University developed the CriticLean framework, improving the accuracy of mathematical formalization from 38% to 84%[19] Market Risks - AI software sales are below expectations, leading to adjustments in capital expenditure plans and slower iteration speeds for core AI products[34]
训练时间减半,性能不降反升!腾讯混元开源图像生成高效强化方案MixGRPO
量子位· 2025-08-02 08:33
Core Viewpoint - The article introduces MixGRPO, a new framework that combines Stochastic Differential Equations (SDE) and Ordinary Differential Equations (ODE) to enhance the efficiency and performance of image generation processes [1][81]. Group 1: MixGRPO Framework - MixGRPO simplifies the optimization process in Markov Decision Processes (MDP) by utilizing a mixed sampling strategy, which improves both efficiency and performance [1][17]. - The framework shows significant improvements in human preference alignment across multiple dimensions, outperforming DanceGRPO with a training time reduction of nearly 50% [2][60]. - MixGRPO-Flash, a faster variant of MixGRPO, further reduces training time by 71% while maintaining similar performance levels [2][60]. Group 2: Performance Metrics - In comparative studies, MixGRPO achieved a higher Unified Reward score of 3.418, compared to DanceGRPO's 3.397, indicating better alignment with human preferences [60]. - MixGRPO-Flash demonstrated an average iteration time of 112.372 seconds, significantly lower than DanceGRPO's 291.284 seconds [60]. Group 3: Sampling Strategy - The MixGRPO framework employs a hybrid sampling method, where SDE sampling is used within a defined interval during the denoising process, while ODE sampling is applied outside this interval [14][20]. - This approach allows for a reduction in computational overhead and optimization difficulty, while ensuring that the sampling process remains aligned with the marginal distributions of SDE and ODE [30][81]. Group 4: Sliding Window Strategy - A sliding window strategy is introduced to optimize the denoising steps, allowing the model to focus on specific time steps during training [32][35]. - The research team identified key hyperparameters for the sliding window, including window size and movement intervals, which significantly impact performance [34][70]. Group 5: High-Order ODE Solvers - The integration of high-order ODE solvers, such as DPM-Solver++, enhances the sampling speed during the GRPO training process, balancing computational cost and performance [45][76]. - The experiments indicated that a second-order midpoint method was optimal for the high-order solver settings [76]. Group 6: Experimental Validation - The experiments utilized the HPDv2 dataset, which includes diverse prompts, demonstrating that MixGRPO can achieve effective human preference alignment with a limited number of training prompts [49][50]. - The results from various reward models confirmed the robustness of MixGRPO, showing superior performance in both single and multi-reward settings [56][82].