强化学习
Search documents
港科大等团队提出WMPO:基于世界模型的VLA策略优化框架
具身智能之心· 2025-11-14 01:02
Core Insights - The article introduces WMPO (World Model-based Policy Optimization), a framework developed by Hong Kong University of Science and Technology and ByteDance Seed team, which enhances sample efficiency, task performance, generalization ability, and lifelong learning through pixel-level video generation for VLA (Vision-Language-Action) models [5][25]. Research Background and Pain Points - Existing solutions struggle to balance scalability and effectiveness, with human intervention requiring continuous supervision and high costs for adapting simulators to diverse scenarios [4]. - Traditional latent space world models misalign with web-scale pre-trained visual features, failing to fully leverage pre-trained knowledge [4] [6]. Core Framework Design - WMPO's logic is based on generating trajectories in an "imagination" space using high-fidelity pixel-level world models, replacing real environment interactions and supporting stronger on-policy reinforcement learning [5][11]. - The iterative process follows "imagination trajectory generation → trajectory sampling evaluation → policy update" [5]. Key Modules - **Generative World Model**: Simulates dynamic changes between the robot and the environment, generating visual trajectories aligned with VLA pre-trained features [8]. - **Lightweight Reward Model**: Automatically assesses the success or failure of imagined trajectories, providing sparse reward signals to avoid complex reward shaping [9]. - **On-Policy Policy Optimization (GRPO)**: Adapts Group Relative Policy Optimization for sparse reward scenarios, balancing stability and scalability [10]. Core Innovations - **Pixel Space Priority**: Directly generates trajectories in pixel space, perfectly matching VLA pre-trained visual features and maximizing the value of pre-trained knowledge [11]. - **Trajectory Generation Logic**: Predicts action blocks based on initial frames and language instructions, generating subsequent frames iteratively [12]. - **Dynamic Sampling Strategy**: Generates multiple imagined trajectories from the initial state, filtering out all-success or all-failure trajectories to ensure effective training samples [12]. Experimental Validation and Key Results - In simulation environments, WMPO outperformed baseline methods (GRPO, DPO) across four fine manipulation tasks, achieving an average success rate of 47.1% with a rollout budget of 128, and 57.6% with a budget of 1280, demonstrating superior sample efficiency [13][14]. - In real environments, WMPO achieved a success rate of 70% in a "block insertion" task, significantly higher than baseline strategies [15]. Emergent Behaviors - WMPO exhibits self-correcting capabilities, autonomously adjusting actions in response to failure states, unlike baseline strategies that continue erroneous actions until timeout [17]. Generalization Ability - WMPO demonstrated an average success rate of 29.6% in out-of-distribution scenarios, outperforming all baseline methods, indicating its learning of general operational skills rather than false visual cues [19][20]. Lifelong Learning - WMPO showed stable performance improvement through iterative collection of trajectories, while DPO struggled with instability and required more expert demonstrations [23]. Conclusion and Significance - WMPO establishes a new paradigm for VLA optimization by integrating world models with on-policy reinforcement learning, addressing high costs and low sample efficiency in real environment interactions. It enhances performance, generalization, and lifelong learning capabilities, paving the way for scalable applications in general robotic operations [25].
谷歌DeepMind最新论文,刚刚登上了Nature,揭秘IMO最强数学模型
3 6 Ke· 2025-11-13 10:05
Core Insights - DeepMind's AlphaProof achieved a silver medal at the International Mathematical Olympiad (IMO), scoring 28 points, just one point shy of gold, marking a significant milestone in AI's mathematical problem-solving capabilities [3][4][20]. Group 1: AlphaProof's Performance - AlphaProof is the first AI system to earn a medal-level score in a prestigious competition like the IMO, demonstrating a leap in AI's ability to tackle complex mathematical challenges [4][20]. - In the 2024 IMO, AlphaProof solved 4 out of 6 problems, including the most difficult problem, showcasing its advanced problem-solving skills [18][20]. - The performance of AlphaProof is comparable to that of a highly trained international high school student, with only about 10% of human participants achieving gold status [18][20]. Group 2: Technical Mechanisms - AlphaProof combines large language models' intuitive reasoning with reinforcement learning, allowing it to learn from a vast dataset of nearly one million mathematical problems [8][10]. - The system utilizes the Lean formal language for mathematical proofs, ensuring that each step of reasoning is verifiable and free from errors typical of natural language models [6][7][10]. - AlphaProof employs a strategy similar to Monte Carlo tree search, breaking down complex problems into manageable sub-goals, enhancing its problem-solving efficiency [11][17]. Group 3: Limitations and Future Directions - Despite its achievements, AlphaProof's efficiency is limited, taking nearly three days to solve problems that human competitors complete in 4.5 hours, indicating room for improvement in speed and resource utilization [21]. - The AI struggles with certain types of problems, particularly those requiring innovative thinking, highlighting the need for enhanced adaptability and generalization capabilities [21][23]. - Future developments aim to enable AlphaProof to understand natural language problems directly, eliminating the need for manual translation into formal expressions [23][24].
Nature公开谷歌IMO金牌模型技术细节,核心团队仅10人,一年给AI编出8000万道数学题训练
3 6 Ke· 2025-11-13 09:01
Core Insights - Google DeepMind has publicly released the complete technology and training methods behind its IMO gold medal model, AlphaProof, continuing its tradition of transparency in AI research [1][22]. Group 1: Development and Team Structure - The AlphaProof team was relatively small, typically consisting of about 10 members, with additional personnel joining closer to the IMO competition [3]. - The core breakthrough was attributed to IMO gold medalist Miklós Horváth, who developed a method to create various problem variants for training the AI [3][5]. Group 2: Technical Architecture - AlphaProof employs a 3 billion parameter encoder-decoder transformer model as its "brain," designed to understand the current proof state and output strategies and step estimates for completing proofs [8][9]. - The system transforms the mathematical proof process into a game-like environment, utilizing a reinforcement learning framework based on the Lean theorem prover [6]. Group 3: Training Methodology - The training faced challenges in sourcing sufficient mathematical problems, initially pre-training the model on approximately 300 billion tokens of code and math text [11]. - A specialized translation system was developed to convert natural language math problems into a formal language understood by Lean, generating around 80 million formalized problems from 1 million natural language questions [11][14]. Group 4: Performance and Achievements - AlphaProof demonstrated impressive performance at the 2024 IMO, successfully solving three problems, including the most difficult one, with a training time of 2-3 days per problem [19][20]. - The system's ability to generate related problem variants during testing significantly enhanced its problem-solving capabilities [19][17]. Group 5: Future Directions and Limitations - Following its success, DeepMind has opened access to AlphaProof for researchers, who have reported its strengths in identifying counterexamples and proving complex statements [22][23]. - However, limitations were noted when dealing with custom definitions, indicating a dependency on existing concepts within the Mathlib library [24]. - The reliance on the Lean theorem prover presents challenges due to its evolving nature, which may affect AlphaProof's performance in advanced mathematical fields [24].
打工人犯困就电一下?发明“电子咖啡手环”的人,真该找个牢坐
3 6 Ke· 2025-11-13 08:55
Core Viewpoint - The emergence of wearable devices like electric wristbands, which use mild electric shocks to stimulate users and enhance focus, reflects a growing cultural trend where exhaustion is viewed as a challenge to overcome rather than a signal for rest [1][5][6]. Group 1: Product and Market Trends - The popularity of products such as "wake-up wristbands" and "smart electric shock wristbands" is on the rise, driven by a culture that increasingly stigmatizes rest as laziness [6][7]. - The marketing of these devices targets overworked individuals, promoting them as tools to combat fatigue and enhance productivity [9][10]. Group 2: Psychological and Health Implications - Continuous pressure in modern workplaces leads to chronic stress, which can result in significant health issues, including cardiovascular diseases and metabolic disorders [14][16]. - The concept of "effort-reward imbalance" suggests that high effort without adequate rewards can lead to chronic stress, further exacerbating health risks [14][16]. Group 3: Comparison with Traditional Stimulants - Unlike caffeine, which works by blocking fatigue signals in a gentler manner, electric shock devices activate stress responses that may lead to feelings of fear rather than genuine alertness [17][18]. - Caffeine consumption has been associated with various health benefits, while electric shock devices may promote a harmful cycle of ignoring bodily signals for rest [20][21]. Group 4: Recommendations for Healthy Work Practices - Emphasizing the importance of rest and recovery, the article suggests that high-energy individuals are those who prioritize breaks and self-care rather than pushing through fatigue [22][23]. - Techniques such as short breaks, mindfulness, and regular physical activity are recommended to maintain energy levels and overall well-being [22][23].
GRPO训练不再「自嗨」!快手可灵 x 中山大学推出「GRPO卫兵」,显著缓解视觉生成过优化
机器之心· 2025-11-13 04:12
Core Insights - The article discusses the introduction of GRPO-Guard, a solution designed to mitigate the over-optimization problem observed in GRPO within flow models, ensuring faster convergence while significantly reducing the risk of over-optimization [3][35]. Group 1: GRPO and Over-Optimization Issues - GRPO has shown significant improvements in image and video generation flow models, but it suffers from a systematic bias in the importance ratio clipping mechanism, leading to over-optimization where the model's performance degrades despite rising proxy rewards [2][14]. - The empirical analysis indicates that the mean of the importance ratio is consistently below 1, which fails to effectively constrain overly confident positive gradients, resulting in suboptimal model performance in real applications [2][14]. Group 2: Introduction of GRPO-Guard - GRPO-Guard introduces two key improvements: RatioNorm, which normalizes the importance ratio distribution to bring the mean closer to 1, and Cross-Step Gradient Balancing, which ensures uniform exploration across the noise schedule [19][21]. - These enhancements restore the effectiveness of the clipping mechanism and stabilize policy updates, thereby alleviating the over-optimization phenomenon [35]. Group 3: Experimental Results - Experiments conducted on various GRPO variants and diffusion backbone models demonstrate that GRPO-Guard significantly alleviates over-optimization while maintaining or even improving performance compared to baseline methods [26][35]. - The results show that in baseline methods, the gold score exhibits a noticeable downward trend, while GRPO-Guard effectively mitigates this decline, indicating improved model robustness [26][28]. Group 4: Future Directions - The article suggests that while GRPO-Guard addresses over-optimization, it does not completely eliminate the issue, as there remains a significant gap between proxy scores and gold scores [35]. - Future efforts should focus on developing more accurate reward models to further reduce reward hacking and enhance optimization outcomes, providing a more reliable technical foundation for GRPO's application in flow models and broader generative tasks [35].
桥介数物完成PreA+轮融资:深创投独家投资,创始人尚阳星年仅26岁
Sou Hu Cai Jing· 2025-11-13 01:46
瑞财经 刘治颖 近日,桥介数物(深圳)科技有限公司(以下简称"桥介数物")宣布完成PreA+轮融 资,本轮融资由深创投独家投资。 本轮融资资金将主要用于下一代云原生机器人动作开发平台的迭代升级和商业化落地,以及加速推进公 司出海战略布局。 值得关注的是,这已是桥介数物今年内完成的第三轮融资。三个月之前,桥介数物刚刚完成PreA轮融 资。 桥介数物成立于2023年,是一家足式机器人控制系统提供商。帮助多家人形机器人公司完成从0到1的强 化学习运动控制demo开发;在2024年8月的世界机器人大会(WRC)上,20多家人形机器人厂商中有11家 采购了桥介的运动控制解决方案。 截至2025年第三季度,桥介数物行为控制方案已成功部署于50余种不同构型的机器人型号,覆盖人形、 四足及轮足等多元应用场景。 天眼查显示,桥介数物实际控制人为尚阳星,总持股比例为55.14%,表决权为63.82%。目前,尚阳星 担任公司董事长。 桥介数物创始人尚阳星,出生于1999年,本科毕业于华中科技大学(2017-2021年),随后保研至南方 科技大学,师从逐际动力创始人张巍教授,并于2023年创立桥介数物。 ...
强化学习 AI 系统的设计实现及未来发展
AI前线· 2025-11-12 04:53
Core Insights - The article discusses the application of Reinforcement Learning (RL) in the design of large language model systems and offers preliminary suggestions for future development [3] - It emphasizes the complexity of RL systems, particularly in their engineering and infrastructure requirements, and highlights the evolution from traditional RLHF systems to more advanced RL applications [4][24] Group 1: RL Theory and Engineering - The engineering demands of RL algorithms are multifaceted, focusing on the integration of large language models with RL systems [4] - The interaction between agents and their environments is crucial, with the environment defined as how the language model interacts with users or tools [7][8] - Reward functions are essential for evaluating actions, and advancements in reward modeling have significantly impacted the application of RL in language models [9][10] Group 2: Algorithmic Developments - The article outlines the evolution of algorithms such as PPO, GRPO, and DPO, noting their respective advantages and limitations in various applications [13][19] - The shift from human feedback to machine feedback in RL practices is highlighted, showcasing the need for more robust evaluation mechanisms [11][24] - The GRPO algorithm's unique approach to estimating advantages without relying on traditional critic models is discussed, emphasizing its application in inference-heavy scenarios [19] Group 3: Large-Scale RL Systems - The rapid advancements in RL applications are noted, with a transition from simple human alignment to more complex model intelligence objectives [24] - The challenges of integrating inference engines and dynamic weight updates in large-scale RL systems are outlined, emphasizing the need for efficient resource management [28][35] - Future developments in RL systems will require a focus on enhancing inference efficiency and flexibility, as well as building more sophisticated evaluation frameworks [41][58] Group 4: Open Source and Community Collaboration - The article mentions various open-source frameworks developed for RL, such as Open RLHF and VeRL, which aim to enhance community collaboration and resource sharing [50][56] - The importance of creating a vibrant ecosystem that balances performance and compatibility in RL systems is emphasized, encouraging industry participation in collaborative design efforts [58]
从目前的信息来看,端到端的落地上限应该很高......
自动驾驶之心· 2025-11-12 00:04
Core Insights - The article highlights significant developments in the autonomous driving industry, particularly the performance of Horizon HSD and the advancements in Xiaopeng's VLA2.0, indicating a shift towards end-to-end production models [1][3]. Group 1: Industry Developments - Horizon HSD's performance has exceeded expectations, marking a return to the industry's focus on one-stage end-to-end production, which has a high potential ceiling [1]. - Xiaopeng's VLA2.0, which integrates visual and language inputs, reinforces the notion that value-added (VA) capabilities are central to autonomous driving technology [1]. Group 2: Educational Initiatives - The article discusses a new course titled "Practical Class for End-to-End Production," aimed at sharing production experiences in autonomous driving, focusing on various methodologies including one-stage and two-stage frameworks, reinforcement learning, and trajectory optimization [3][8]. - The course is limited to 40 participants, emphasizing a targeted approach to skill development in the industry [3][5]. Group 3: Course Structure - The course consists of eight chapters covering topics such as end-to-end task overview, two-stage and one-stage algorithm frameworks, navigation information applications, reinforcement learning algorithms, trajectory output optimization, fallback solutions, and production experience sharing [8][9][10][11][12][13][14][15]. - Each chapter is designed to build upon the previous one, providing a comprehensive understanding of the end-to-end production process in autonomous driving [16]. Group 4: Target Audience and Requirements - The course is aimed at advanced learners with a background in autonomous driving algorithms, reinforcement learning, and programming skills, although it is also accessible to those with less experience [16][17]. - Participants are required to have a GPU with recommended specifications and a foundational understanding of relevant mathematical concepts [17].
6666!NuerIPS满分论文来了
量子位· 2025-11-11 11:11
Core Insights - The article discusses a groundbreaking paper that challenges the prevailing belief that reinforcement learning (RL) is essential for enhancing reasoning capabilities in large language models (LLMs), suggesting instead that model distillation may be more effective [1][5][12]. Group 1: Research Findings - The paper titled "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" received a perfect score at NeurIPS, indicating its significant impact [5][6]. - The research team from Tsinghua University and Shanghai Jiao Tong University found that RL primarily reinforces existing reasoning paths rather than discovering new ones, which contradicts the common assumption that RL can expand a model's reasoning capabilities [10][12]. - The study utilized the pass@k metric to evaluate model performance, revealing that RL models perform better at lower sampling rates but are outperformed by base models at higher sampling rates, indicating that the base model's reasoning abilities may be underestimated [14][20]. Group 2: Methodology - The research involved testing various models across three key application areas: mathematical reasoning, code generation, and visual reasoning, using authoritative benchmark datasets [17][19]. - The models compared included mainstream LLMs like Qwen2.5 and LLaMA-3.1, with RL models trained using algorithms such as PPO, GRPO, and Reinforce++ [18][19]. - The analysis focused on the differences in pass@k performance between RL and base models, as well as the trends in performance as sampling increased [21][22]. Group 3: Implications for the Industry - The findings suggest that the substantial investments and explorations surrounding RLVR may need to be reevaluated, as the actual benefits of RL in enhancing reasoning capabilities could be overestimated [4][12]. - The research highlights the potential of model distillation as a more promising approach for expanding reasoning capabilities in LLMs, which could shift industry focus and funding [10][12].
不怕Claude断供,豆包编程模型来了,5分钟造“我的世界”翻版,花费2毛钱
3 6 Ke· 2025-11-11 09:25
Core Insights - The launch of Doubao-Seed-Code, the first programming model from the Doubao model family by ByteDance's Volcano Engine, focuses on optimizing Agentic Coding tasks and offers a competitive price-performance ratio [1][3][33] Performance and Features - Doubao-Seed-Code outperforms several domestic models like DeepSeek-V3.1, Kimi-K2, and GLM-4.6, with scores only second to the top model Claude Sonnet 4.5, and features a native context of 256K, surpassing Claude Sonnet 4.5's 200K [1][3] - The model supports visual understanding, allowing it to generate code from UI design drafts, screenshots, or hand-drawn sketches, significantly enhancing front-end development efficiency [3][19] - Doubao-Seed-Code integrates seamlessly with popular development tools, enabling users to switch from Claude Code with minimal learning curve [7][31] Cost Efficiency - The model employs a tiered pricing model, with input costs at 1.20 yuan per million tokens and output costs at 8.00 yuan per million tokens, achieving a 62.7% reduction in overall usage costs with full transparent caching [4][31] Real-World Application - Doubao-Seed-Code has demonstrated capabilities in real programming scenarios, such as autonomously planning development tasks, quickly building front-end web pages, and modifying databases while actively fixing errors and optimizing structures [6][16] - The model can create functional prototypes based on detailed prompts, showcasing its ability to handle complex development tasks effectively [17][25] Training and Development - The model was trained using a large-scale Agent reinforcement learning system, utilizing a dataset covering 100,000 container images and providing an end-to-end sandbox environment for evaluation [27][29] - Doubao-Seed-Code's training process emphasizes pure reinforcement learning, achieving state-of-the-art performance in software engineering tasks without the need for distilled or labeled cold-start data [29] Market Position - The emergence of Doubao-Seed-Code addresses the supply risks faced by overseas AI programming models, providing developers with a stable and controllable alternative [33]