强化学习
Search documents
报名启动!快来和张亚勤孙茂松一起参与MEET2026智能未来大会
量子位· 2025-11-14 05:38
Core Insights - The article emphasizes the transformative impact of artificial intelligence (AI) on various industries and society as a whole, marking the beginning of a new era in 2025 [1]. Event Overview - The MEET2026 Intelligent Future Conference will focus on cutting-edge technologies and industry developments related to AI [2]. - The theme of the conference is "Symbiosis Without Boundaries, Intelligence to Ignite the Future," highlighting how AI transcends industry, discipline, and scenario boundaries [3]. - Key topics of discussion will include reinforcement learning, multimodal AI, chip computing power, AI in various industries, and AI's global expansion [4]. Academic and Industry Contributions - The conference will feature the latest advancements from academia and industry, showcasing leading technologies from infrastructure, models, and products [5]. - An authoritative annual AI ranking and trend report will be released during the conference [6]. Notable Speakers - The conference will host prominent figures such as Zhang Yaqin, a renowned scientist and entrepreneur in AI and digital video [12][13]. - Other notable speakers include Sun Maosong, Wang Zhongyuan, Zhao Junbo, and Liu Fanping, all of whom have significant contributions to AI research and development [17][21][27][43]. Awards and Recognition - The "Artificial Intelligence Annual Ranking," initiated by Quantum Bit, has become one of the most influential rankings in the AI industry, evaluating companies, products, and individuals across three dimensions [60]. - The ranking results will be officially announced at the MEET2026 conference [60]. Trend Report - The "2025 Annual AI Top Ten Trends Report" will focus on the main themes of technological development, analyzing the maturity, implementation status, and potential value of AI trends [65]. - The report will nominate representative institutions and best cases related to these trends [65]. Conference Logistics - The MEET2026 conference will take place at the Beijing Jinmao Renaissance Hotel, attracting thousands of tech professionals and millions of online viewers [72].
港科大等团队提出WMPO:基于世界模型的VLA策略优化框架
具身智能之心· 2025-11-14 01:02
Core Insights - The article introduces WMPO (World Model-based Policy Optimization), a framework developed by Hong Kong University of Science and Technology and ByteDance Seed team, which enhances sample efficiency, task performance, generalization ability, and lifelong learning through pixel-level video generation for VLA (Vision-Language-Action) models [5][25]. Research Background and Pain Points - Existing solutions struggle to balance scalability and effectiveness, with human intervention requiring continuous supervision and high costs for adapting simulators to diverse scenarios [4]. - Traditional latent space world models misalign with web-scale pre-trained visual features, failing to fully leverage pre-trained knowledge [4] [6]. Core Framework Design - WMPO's logic is based on generating trajectories in an "imagination" space using high-fidelity pixel-level world models, replacing real environment interactions and supporting stronger on-policy reinforcement learning [5][11]. - The iterative process follows "imagination trajectory generation → trajectory sampling evaluation → policy update" [5]. Key Modules - **Generative World Model**: Simulates dynamic changes between the robot and the environment, generating visual trajectories aligned with VLA pre-trained features [8]. - **Lightweight Reward Model**: Automatically assesses the success or failure of imagined trajectories, providing sparse reward signals to avoid complex reward shaping [9]. - **On-Policy Policy Optimization (GRPO)**: Adapts Group Relative Policy Optimization for sparse reward scenarios, balancing stability and scalability [10]. Core Innovations - **Pixel Space Priority**: Directly generates trajectories in pixel space, perfectly matching VLA pre-trained visual features and maximizing the value of pre-trained knowledge [11]. - **Trajectory Generation Logic**: Predicts action blocks based on initial frames and language instructions, generating subsequent frames iteratively [12]. - **Dynamic Sampling Strategy**: Generates multiple imagined trajectories from the initial state, filtering out all-success or all-failure trajectories to ensure effective training samples [12]. Experimental Validation and Key Results - In simulation environments, WMPO outperformed baseline methods (GRPO, DPO) across four fine manipulation tasks, achieving an average success rate of 47.1% with a rollout budget of 128, and 57.6% with a budget of 1280, demonstrating superior sample efficiency [13][14]. - In real environments, WMPO achieved a success rate of 70% in a "block insertion" task, significantly higher than baseline strategies [15]. Emergent Behaviors - WMPO exhibits self-correcting capabilities, autonomously adjusting actions in response to failure states, unlike baseline strategies that continue erroneous actions until timeout [17]. Generalization Ability - WMPO demonstrated an average success rate of 29.6% in out-of-distribution scenarios, outperforming all baseline methods, indicating its learning of general operational skills rather than false visual cues [19][20]. Lifelong Learning - WMPO showed stable performance improvement through iterative collection of trajectories, while DPO struggled with instability and required more expert demonstrations [23]. Conclusion and Significance - WMPO establishes a new paradigm for VLA optimization by integrating world models with on-policy reinforcement learning, addressing high costs and low sample efficiency in real environment interactions. It enhances performance, generalization, and lifelong learning capabilities, paving the way for scalable applications in general robotic operations [25].
谷歌DeepMind最新论文,刚刚登上了Nature,揭秘IMO最强数学模型
3 6 Ke· 2025-11-13 10:05
Core Insights - DeepMind's AlphaProof achieved a silver medal at the International Mathematical Olympiad (IMO), scoring 28 points, just one point shy of gold, marking a significant milestone in AI's mathematical problem-solving capabilities [3][4][20]. Group 1: AlphaProof's Performance - AlphaProof is the first AI system to earn a medal-level score in a prestigious competition like the IMO, demonstrating a leap in AI's ability to tackle complex mathematical challenges [4][20]. - In the 2024 IMO, AlphaProof solved 4 out of 6 problems, including the most difficult problem, showcasing its advanced problem-solving skills [18][20]. - The performance of AlphaProof is comparable to that of a highly trained international high school student, with only about 10% of human participants achieving gold status [18][20]. Group 2: Technical Mechanisms - AlphaProof combines large language models' intuitive reasoning with reinforcement learning, allowing it to learn from a vast dataset of nearly one million mathematical problems [8][10]. - The system utilizes the Lean formal language for mathematical proofs, ensuring that each step of reasoning is verifiable and free from errors typical of natural language models [6][7][10]. - AlphaProof employs a strategy similar to Monte Carlo tree search, breaking down complex problems into manageable sub-goals, enhancing its problem-solving efficiency [11][17]. Group 3: Limitations and Future Directions - Despite its achievements, AlphaProof's efficiency is limited, taking nearly three days to solve problems that human competitors complete in 4.5 hours, indicating room for improvement in speed and resource utilization [21]. - The AI struggles with certain types of problems, particularly those requiring innovative thinking, highlighting the need for enhanced adaptability and generalization capabilities [21][23]. - Future developments aim to enable AlphaProof to understand natural language problems directly, eliminating the need for manual translation into formal expressions [23][24].
Nature公开谷歌IMO金牌模型技术细节,核心团队仅10人,一年给AI编出8000万道数学题训练
3 6 Ke· 2025-11-13 09:01
Core Insights - Google DeepMind has publicly released the complete technology and training methods behind its IMO gold medal model, AlphaProof, continuing its tradition of transparency in AI research [1][22]. Group 1: Development and Team Structure - The AlphaProof team was relatively small, typically consisting of about 10 members, with additional personnel joining closer to the IMO competition [3]. - The core breakthrough was attributed to IMO gold medalist Miklós Horváth, who developed a method to create various problem variants for training the AI [3][5]. Group 2: Technical Architecture - AlphaProof employs a 3 billion parameter encoder-decoder transformer model as its "brain," designed to understand the current proof state and output strategies and step estimates for completing proofs [8][9]. - The system transforms the mathematical proof process into a game-like environment, utilizing a reinforcement learning framework based on the Lean theorem prover [6]. Group 3: Training Methodology - The training faced challenges in sourcing sufficient mathematical problems, initially pre-training the model on approximately 300 billion tokens of code and math text [11]. - A specialized translation system was developed to convert natural language math problems into a formal language understood by Lean, generating around 80 million formalized problems from 1 million natural language questions [11][14]. Group 4: Performance and Achievements - AlphaProof demonstrated impressive performance at the 2024 IMO, successfully solving three problems, including the most difficult one, with a training time of 2-3 days per problem [19][20]. - The system's ability to generate related problem variants during testing significantly enhanced its problem-solving capabilities [19][17]. Group 5: Future Directions and Limitations - Following its success, DeepMind has opened access to AlphaProof for researchers, who have reported its strengths in identifying counterexamples and proving complex statements [22][23]. - However, limitations were noted when dealing with custom definitions, indicating a dependency on existing concepts within the Mathlib library [24]. - The reliance on the Lean theorem prover presents challenges due to its evolving nature, which may affect AlphaProof's performance in advanced mathematical fields [24].
打工人犯困就电一下?发明“电子咖啡手环”的人,真该找个牢坐
3 6 Ke· 2025-11-13 08:55
你戴上手环,屏幕显示"专注模式启动",然后"滋"地一声,微电流自腕部传来。 这不是科幻,也不是反乌托邦小说里的片段。在职场疲惫日复一日的当代,电击腕带正悄然进入一些人的日常生活:当你感到身体疲乏、思绪迟缓时,这 款"高效觉醒神器"用电击让你继续。 网友戏称其为"可穿戴型杨永信"|公众号 类似的产品开始受到追捧,背后投射出来的其实是一种文化:我们越来越难以停下,越来越倾向于把疲惫当成"可克服的错误",把休息当成"偷懒的罪 名"。 被"电醒"的打工人 你清醒了,继续盯屏,继续刷表,继续为一个小时后必须完成的任务拼尽全力。 作为一个在人类压力领域深耕了很多年的心理学家,当我看到这种所谓的"电子咖啡"在社交媒体、年轻职场群体中不断蔓延时,不禁忧心忡忡。 据报道,越来越多的"高效工具"以各种"唤醒""重启"姿态出现,用户在疲惫与注意力低下时,被引导去尝试以电击代替短暂的休息或午睡。虽然公开报道 中尚未系统统计其规模,但"醒神手环""智能电击手环""办公提神神器"这类关键词在网络搜索中增长趋势愈发明显。 为什么这种产品会出现且流行起来? 首先,是不断被压缩的时间和越来越快的工作节奏。现代职场中,加班、任务碎片化、注意力分 ...
GRPO训练不再「自嗨」!快手可灵 x 中山大学推出「GRPO卫兵」,显著缓解视觉生成过优化
机器之心· 2025-11-13 04:12
Core Insights - The article discusses the introduction of GRPO-Guard, a solution designed to mitigate the over-optimization problem observed in GRPO within flow models, ensuring faster convergence while significantly reducing the risk of over-optimization [3][35]. Group 1: GRPO and Over-Optimization Issues - GRPO has shown significant improvements in image and video generation flow models, but it suffers from a systematic bias in the importance ratio clipping mechanism, leading to over-optimization where the model's performance degrades despite rising proxy rewards [2][14]. - The empirical analysis indicates that the mean of the importance ratio is consistently below 1, which fails to effectively constrain overly confident positive gradients, resulting in suboptimal model performance in real applications [2][14]. Group 2: Introduction of GRPO-Guard - GRPO-Guard introduces two key improvements: RatioNorm, which normalizes the importance ratio distribution to bring the mean closer to 1, and Cross-Step Gradient Balancing, which ensures uniform exploration across the noise schedule [19][21]. - These enhancements restore the effectiveness of the clipping mechanism and stabilize policy updates, thereby alleviating the over-optimization phenomenon [35]. Group 3: Experimental Results - Experiments conducted on various GRPO variants and diffusion backbone models demonstrate that GRPO-Guard significantly alleviates over-optimization while maintaining or even improving performance compared to baseline methods [26][35]. - The results show that in baseline methods, the gold score exhibits a noticeable downward trend, while GRPO-Guard effectively mitigates this decline, indicating improved model robustness [26][28]. Group 4: Future Directions - The article suggests that while GRPO-Guard addresses over-optimization, it does not completely eliminate the issue, as there remains a significant gap between proxy scores and gold scores [35]. - Future efforts should focus on developing more accurate reward models to further reduce reward hacking and enhance optimization outcomes, providing a more reliable technical foundation for GRPO's application in flow models and broader generative tasks [35].
桥介数物完成PreA+轮融资:深创投独家投资,创始人尚阳星年仅26岁
Sou Hu Cai Jing· 2025-11-13 01:46
瑞财经 刘治颖 近日,桥介数物(深圳)科技有限公司(以下简称"桥介数物")宣布完成PreA+轮融 资,本轮融资由深创投独家投资。 本轮融资资金将主要用于下一代云原生机器人动作开发平台的迭代升级和商业化落地,以及加速推进公 司出海战略布局。 值得关注的是,这已是桥介数物今年内完成的第三轮融资。三个月之前,桥介数物刚刚完成PreA轮融 资。 桥介数物成立于2023年,是一家足式机器人控制系统提供商。帮助多家人形机器人公司完成从0到1的强 化学习运动控制demo开发;在2024年8月的世界机器人大会(WRC)上,20多家人形机器人厂商中有11家 采购了桥介的运动控制解决方案。 截至2025年第三季度,桥介数物行为控制方案已成功部署于50余种不同构型的机器人型号,覆盖人形、 四足及轮足等多元应用场景。 天眼查显示,桥介数物实际控制人为尚阳星,总持股比例为55.14%,表决权为63.82%。目前,尚阳星 担任公司董事长。 桥介数物创始人尚阳星,出生于1999年,本科毕业于华中科技大学(2017-2021年),随后保研至南方 科技大学,师从逐际动力创始人张巍教授,并于2023年创立桥介数物。 ...
强化学习 AI 系统的设计实现及未来发展
AI前线· 2025-11-12 04:53
Core Insights - The article discusses the application of Reinforcement Learning (RL) in the design of large language model systems and offers preliminary suggestions for future development [3] - It emphasizes the complexity of RL systems, particularly in their engineering and infrastructure requirements, and highlights the evolution from traditional RLHF systems to more advanced RL applications [4][24] Group 1: RL Theory and Engineering - The engineering demands of RL algorithms are multifaceted, focusing on the integration of large language models with RL systems [4] - The interaction between agents and their environments is crucial, with the environment defined as how the language model interacts with users or tools [7][8] - Reward functions are essential for evaluating actions, and advancements in reward modeling have significantly impacted the application of RL in language models [9][10] Group 2: Algorithmic Developments - The article outlines the evolution of algorithms such as PPO, GRPO, and DPO, noting their respective advantages and limitations in various applications [13][19] - The shift from human feedback to machine feedback in RL practices is highlighted, showcasing the need for more robust evaluation mechanisms [11][24] - The GRPO algorithm's unique approach to estimating advantages without relying on traditional critic models is discussed, emphasizing its application in inference-heavy scenarios [19] Group 3: Large-Scale RL Systems - The rapid advancements in RL applications are noted, with a transition from simple human alignment to more complex model intelligence objectives [24] - The challenges of integrating inference engines and dynamic weight updates in large-scale RL systems are outlined, emphasizing the need for efficient resource management [28][35] - Future developments in RL systems will require a focus on enhancing inference efficiency and flexibility, as well as building more sophisticated evaluation frameworks [41][58] Group 4: Open Source and Community Collaboration - The article mentions various open-source frameworks developed for RL, such as Open RLHF and VeRL, which aim to enhance community collaboration and resource sharing [50][56] - The importance of creating a vibrant ecosystem that balances performance and compatibility in RL systems is emphasized, encouraging industry participation in collaborative design efforts [58]
从目前的信息来看,端到端的落地上限应该很高......
自动驾驶之心· 2025-11-12 00:04
Core Insights - The article highlights significant developments in the autonomous driving industry, particularly the performance of Horizon HSD and the advancements in Xiaopeng's VLA2.0, indicating a shift towards end-to-end production models [1][3]. Group 1: Industry Developments - Horizon HSD's performance has exceeded expectations, marking a return to the industry's focus on one-stage end-to-end production, which has a high potential ceiling [1]. - Xiaopeng's VLA2.0, which integrates visual and language inputs, reinforces the notion that value-added (VA) capabilities are central to autonomous driving technology [1]. Group 2: Educational Initiatives - The article discusses a new course titled "Practical Class for End-to-End Production," aimed at sharing production experiences in autonomous driving, focusing on various methodologies including one-stage and two-stage frameworks, reinforcement learning, and trajectory optimization [3][8]. - The course is limited to 40 participants, emphasizing a targeted approach to skill development in the industry [3][5]. Group 3: Course Structure - The course consists of eight chapters covering topics such as end-to-end task overview, two-stage and one-stage algorithm frameworks, navigation information applications, reinforcement learning algorithms, trajectory output optimization, fallback solutions, and production experience sharing [8][9][10][11][12][13][14][15]. - Each chapter is designed to build upon the previous one, providing a comprehensive understanding of the end-to-end production process in autonomous driving [16]. Group 4: Target Audience and Requirements - The course is aimed at advanced learners with a background in autonomous driving algorithms, reinforcement learning, and programming skills, although it is also accessible to those with less experience [16][17]. - Participants are required to have a GPU with recommended specifications and a foundational understanding of relevant mathematical concepts [17].
6666!NuerIPS满分论文来了
量子位· 2025-11-11 11:11
Core Insights - The article discusses a groundbreaking paper that challenges the prevailing belief that reinforcement learning (RL) is essential for enhancing reasoning capabilities in large language models (LLMs), suggesting instead that model distillation may be more effective [1][5][12]. Group 1: Research Findings - The paper titled "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" received a perfect score at NeurIPS, indicating its significant impact [5][6]. - The research team from Tsinghua University and Shanghai Jiao Tong University found that RL primarily reinforces existing reasoning paths rather than discovering new ones, which contradicts the common assumption that RL can expand a model's reasoning capabilities [10][12]. - The study utilized the pass@k metric to evaluate model performance, revealing that RL models perform better at lower sampling rates but are outperformed by base models at higher sampling rates, indicating that the base model's reasoning abilities may be underestimated [14][20]. Group 2: Methodology - The research involved testing various models across three key application areas: mathematical reasoning, code generation, and visual reasoning, using authoritative benchmark datasets [17][19]. - The models compared included mainstream LLMs like Qwen2.5 and LLaMA-3.1, with RL models trained using algorithms such as PPO, GRPO, and Reinforce++ [18][19]. - The analysis focused on the differences in pass@k performance between RL and base models, as well as the trends in performance as sampling increased [21][22]. Group 3: Implications for the Industry - The findings suggest that the substantial investments and explorations surrounding RLVR may need to be reevaluated, as the actual benefits of RL in enhancing reasoning capabilities could be overestimated [4][12]. - The research highlights the potential of model distillation as a more promising approach for expanding reasoning capabilities in LLMs, which could shift industry focus and funding [10][12].