强化学习
Search documents
手把手带你入门机器人学习,HuggingFace联合牛津大学新教程开源SOTA资源库
机器之心· 2025-10-26 07:00
Core Viewpoint - The article emphasizes the significant advancements in the field of robotics, particularly in robot learning, driven by the development of artificial intelligence technologies such as large models and multi-modal models. This shift has transformed traditional robotics into a learning-based paradigm, opening new potentials for autonomous decision-making robots [2]. Group 1: Introduction to Robot Learning - The article highlights the evolution of robotics from explicit modeling to implicit modeling, marking a fundamental change in motion generation methods. Traditional robotics relied on explicit modeling, while learning-based methods utilize deep reinforcement learning and expert demonstration learning for implicit modeling [15]. - A comprehensive tutorial provided by HuggingFace and researchers from Oxford University serves as a valuable resource for newcomers to modern robot learning, covering foundational principles of reinforcement learning and imitation learning [3][4]. Group 2: Learning-Based Robotics - Learning-based robotics simplifies the process from perception to action by training a unified high-level controller that can directly handle high-dimensional, unstructured perception-motion information without relying on a dynamics model [33]. - The tutorial addresses challenges in real-world applications, such as safety and efficiency issues during initial training phases, and high trial-and-error costs in physical environments. It introduces advanced techniques like simulator training and domain randomization to mitigate these risks [34][35]. Group 3: Reinforcement Learning - Reinforcement learning allows robots to autonomously learn optimal behavior strategies through trial and error, showcasing significant potential across various scenarios [28]. - The tutorial discusses the "Offline-to-Online" reinforcement learning framework, which enhances sample efficiency and safety by utilizing pre-collected expert data. The HIL-SERL method exemplifies this approach, enabling robots to master complex real-world tasks with near 100% success rates in just 1-2 hours of training [36][39]. Group 4: Imitation Learning - Imitation learning offers a more direct learning path for robots by replicating expert actions through behavior cloning, avoiding complex reward function designs and ensuring training safety [41]. - The tutorial presents advanced imitation learning methods based on generative models, such as Action Chunking with Transformers (ACT) and Diffusion Policy, which effectively model multi-modal data by learning the latent distribution of expert behaviors [42][43]. Group 5: Universal Robot Policies - The article envisions the future of robotics in developing universal robot policies capable of operating across tasks and devices, inspired by the emergence of large-scale open robot datasets and powerful visual-language models (VLMs) [52]. - Two cutting-edge VLA models, π₀ and SmolVLA, are highlighted for their ability to understand visual and language instructions and generate precise robot control commands, with SmolVLA being a compact, open-source model that significantly reduces application barriers [53][56].
从世界模型到VLA再到强化,具身大小脑算法原来是这样的!
具身智能之心· 2025-10-26 04:02
Core Insights - The article discusses the evolution and current state of embodied intelligence, focusing on the roles of the brain and cerebellum in robotics, where the brain handles perception and planning, while the cerebellum is responsible for execution [3][10]. Technical Evolution - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection, moving to behavior cloning, and now advancing to diffusion policy and VLA models [7][10]. - The first stage focused on static object grasping with limited decision-making capabilities [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations but faced challenges in generalization and error accumulation [8]. - The third stage, marked by the introduction of diffusion policy, improved stability and generalization by modeling action sequences [8]. - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and world models to enhance robots' predictive and interactive capabilities [9][10]. Current Trends and Applications - The integration of VLA with reinforcement learning enhances robots' trial-and-error learning and self-improvement abilities, while the combination with world models allows for future prediction and better planning [10]. - The article highlights the growing demand for embodied intelligence applications across various sectors, including industrial, home, restaurant, and medical rehabilitation, leading to increased job opportunities and research interest in the field [10]. Educational Initiatives - The article outlines a structured learning program aimed at equipping individuals with comprehensive knowledge of embodied intelligence algorithms, including practical applications and real-world projects [11][14]. - The course targets individuals with a foundational understanding of embodied intelligence and aims to bridge the gap between theoretical knowledge and practical deployment [18][24].
摇人!寻找散落在各地的自动驾驶热爱者(产品/4D标注/世界模型等)
自动驾驶之心· 2025-10-25 16:03
Core Viewpoint - The article emphasizes the need for collaboration in the autonomous driving industry, inviting professionals to participate in training, course development, and research support to drive industry progress [2]. Group 1: Collaboration and Opportunities - The company is seeking partnerships with professionals in the autonomous driving field to enhance training and job guidance services [2]. - High compensation and abundant industry resources will be provided to collaborators [3]. - The main focus areas for collaboration include roles such as autonomous driving product managers, 4D annotation/data loop, world models, VLA, autonomous driving large models, reinforcement learning, and end-to-end systems [4]. Group 2: Training and Development - The positions are primarily aimed at B2B training for enterprises, universities, and research institutions, as well as C2C training for students and job seekers [5]. - The company encourages interested individuals to reach out for further consultation via WeChat [6].
快手Klear团队提出CE-GPPO:通过梯度保留协调熵,解决强化学习中的熵不稳定问题
机器之心· 2025-10-25 01:03
Core Insights - The article discusses the development of a new reinforcement learning algorithm called CE-GPPO, which aims to balance exploration and exploitation in training large language models [3][11][21] - The Klear team from Kuaishou Technology has made significant advancements in AI, particularly in the area of language models, achieving state-of-the-art results in mathematical and coding benchmarks [2][21] Research Motivation - The core challenge in optimizing large models for complex reasoning tasks using reinforcement learning is balancing policy entropy, which represents the uncertainty in action selection [6][21] - Existing methods face instability issues due to entropy collapse and explosion, leading to either a lack of exploration or excessive exploration [6][21] Algorithm Design - CE-GPPO introduces a new approach to gradient clipping, allowing for the retention and scaling of gradients from low-probability tokens to maintain a balance between exploration and convergence [11][15] - The algorithm employs two adjustable hyperparameters, β₁ and β₂, to control the gradient weights of different token types, facilitating a flexible adjustment between exploration and exploitation [15][24] Experimental Results - CE-GPPO was tested on multiple mathematical reasoning benchmarks, showing superior performance compared to other methods, particularly in high-difficulty tasks [20][21] - The results indicate that larger model sizes benefit more from CE-GPPO, demonstrating its scalability potential [21][24] Comparison with Other Algorithms - CE-GPPO outperformed other recent reinforcement learning algorithms like CISPO and GSPO, showcasing its effectiveness in maintaining training stability and performance [35][36] - The method also demonstrated advantages over traditional entropy regularization techniques, maintaining a stable entropy curve throughout training [37]
强化学习是怎么赋能人形/四足/机械臂等本体的?学术界是怎么展开的?
具身智能之心· 2025-10-24 10:00
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadruped robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][9] - To address these challenges, a specialized 1v6 mentoring course in RL has been launched, aimed at helping students produce quality research papers [6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enabling applications in rescue and hazardous environments [2][8] Group 2: Challenges in Learning and Research - The vast and intricate nature of RL makes it difficult for beginners to enter the field, often leading to frustration and abandonment of studies [5][9] - Producing a research paper requires proficiency in methodology, experimental results, and writing, with any misstep potentially resulting in low scores from reviewers [5] Group 3: Course Offerings and Structure - The 1v6 mentoring course is designed for graduate students and others needing guidance on research papers, featuring small class sizes and weekly live sessions [7][9] - The course spans 14 weeks of intensive training followed by 8 weeks of maintenance support, focusing on various aspects of RL and robotics [9][15] - Participants will receive guidance on paper ideas, project implementation, experimental guidance, writing refinement, and initial draft formation for conferences like RAL, ICRA, IROS, and CoRL [7][9][15] Group 4: Course Content and Deliverables - The curriculum includes topics such as RL fundamentals, simulation environments, sim2real techniques, and writing guidance, with a focus on practical applications in quadruped, humanoid, and robotic arms [17][19][20] - Students will produce a research paper draft by the end of the course, with support for revisions and submission processes [23][28]
有的同学还没入门具身,有的已经CCF-A!?
具身智能之心· 2025-10-24 10:00
Group 1 - The article introduces a new paper tutoring service that offers one-on-one customized guidance in various advanced research areas such as multimodal models, reinforcement learning, and robotics simulation [1] - The tutoring service covers a wide range of academic levels, from CCF-A to CCF-C and SCI Zone 1 to Zone 4, including support for graduation theses and doctoral applications [1] - The team consists of experienced PhD mentors and researchers from top universities and leading companies, with expertise in reviewing papers for prestigious conferences like ICML, ICLR, and NeurIPS [1] Group 2 - The service emphasizes a dual perspective from both industry and academia, focusing not only on publishing papers but also on their practical value [2] - The first ten students who inquire will receive a free matching with a dedicated mentor for in-depth analysis and tailored publication strategy suggestions [3]
现在,最会赚钱的AI是Qwen3,全球六大模型厮杀,Top 2来自中国
3 6 Ke· 2025-10-23 12:49
Core Insights - Qwen3 Max has emerged as the leading model in the AI trading competition, surpassing DeepSeek and achieving significant profitability [1][32] - The competition, Alpha Arena, showcases the capabilities of various AI models in real market conditions, emphasizing the financial market as a training ground for AI [30][32] Performance Summary - Qwen3 Max achieved a return of +44.38%, with an account value of $14,438 and total profit of $4,438 [11] - DeepSeek V3.1 follows with a return of +20.92%, account value of $12,092, and total profit of $2,092 [11] - Other models, such as Claude 4.5 Sonnet, Grok 4, Gemini 2.5 Pro, and GPT-5, reported negative returns, with GPT-5 showing the largest loss at -71.48% [10][11] Competition Dynamics - The competition began on October 18 and has seen Qwen3 Max steadily improve its position, particularly after a significant drop in all models on October 22 [22][24] - Qwen3 Max's strategy has been characterized as "quick and precise," allowing it to capitalize on market opportunities effectively [8][32] - The competition has highlighted the contrasting performance of models, with Qwen3 Max and DeepSeek being the only two models consistently performing well [22][24] Market Implications - The success of Qwen3 Max indicates the growing competitiveness of Chinese AI models in the global market, particularly in high-risk financial environments [33] - The Alpha Arena competition serves as a demonstration of how AI can adapt and thrive in real-world financial scenarios, reinforcing the notion that financial markets are ideal for AI training [30][32]
晚点独家丨智谱前 COO 张帆创立元理智能完成 800 万美元种子轮,蓝驰创投领投
晚点LatePost· 2025-10-23 10:21
Core Insights - The article discusses the recent seed funding of $8 million for Yuanli Intelligence, a company focused on training digital employees through commercial reinforcement learning [4] - The founder, Zhang Fan, emphasizes the shift from knowledge modeling to productivity modeling in AI development [7] Company Overview - Yuanli Intelligence was founded by Zhang Fan, who has a background in AI and has previously held significant roles in companies like Sogou and Tencent [4] - The company aims to enhance business outcomes by integrating commercial know-how with reinforcement learning to create models suitable for business applications [7] Industry Trends - The article identifies three categories of companies exploring AI-native enterprise services: new startups, SaaS companies leveraging AI, and large tech firms developing service platforms [6] - The current mainstream approach in the Agent to B sector is based on "customization + Full-parameter Fine-Tuning," which presents high deployment costs for individual scenarios [7] Investment Landscape - The article notes a cautious investment climate in the Agent to B sector due to previous failures in the enterprise service market, with investors focusing on top-tier founders and companies [7]
6800万美元,清华、北大、上海交大多位校友获奖,亚马逊AI博士奖学金公布
机器之心· 2025-10-23 07:45
Group 1 - Amazon has announced the recipients of its AI PhD Scholarship, funding over 100 PhD students from nine universities to research machine learning, computer vision, and natural language processing [1] - The participating universities include CMU, Johns Hopkins University, MIT, Stanford University, UC Berkeley, UCLA, University of Illinois Urbana-Champaign, University of Texas at Austin, and University of Washington [1] - The program will provide $10 million in funding for the academic years 2025-2026 and 2026-2027, along with an additional $24 million in Amazon Web Services (AWS) cloud credits each year, totaling $68 million over two years [2] Group 2 - Several universities have already announced their selected PhD candidates, including notable Chinese scholars [3] - Jenny Huang from MIT focuses on data-driven machine learning and uncertainty quantification [4][6] - David Jin from MIT is interested in scalable computing and AI-driven decision systems [8][6] - Songyuan Zhang from MIT is researching safe multi-agent systems and intelligent assistive robots [11][6] Group 3 - Yuxiao Qu from CMU aims to endow AI agents with human-like curiosity to advance scientific research [12][14] - Danqing Wang from CMU is working on integrating safety and functionality into training for reliable AI agents [15][17] - Mengdi Wu from CMU focuses on machine learning for optimizing computational kernel strategies [18][20] Group 4 - Dacheng Li from UC Berkeley is developing efficient AI and artificial worlds through visual and text generation models [34][36] - Hao Wang from UC Berkeley is researching practical secure code generation through controlled reasoning [37][39] - Melissa Pan from UC Berkeley is interested in sustainability in large-scale machine learning and data center systems [40][42] Group 5 - Haoyu Li from UT Austin is utilizing AI to enhance modern system performance and availability [49][51] - Junbo Li from UT Austin is focused on agentic large language models and reinforcement learning [52][54] - Kaizhao Liang from UT Austin is researching efficient training methods and sparse neural networks [56][58] Group 6 - Zeping Liu from UT Austin is advancing geospatial AI research with a focus on geographic foundational models [59][61] - Haoran Xu from UT Austin is expanding reinforcement learning methods and integrating generative AI [62][64] - Chutong Yang from UT Austin is interested in algorithm design and analysis in trustworthy machine learning [65][67] Group 7 - Xiao Zhang from UT Austin is focusing on networked and distributed systems to achieve predictable AI performance in 5G edge environments [68][69] - The list of awardees will continue to be updated as more universities announce their recipients [70]
Qwen 3 Max领跑“AI投资实战赛”:阿里通义千问在Alpha Arena跑赢GPT-5与Gemini
Jing Ji Guan Cha Wang· 2025-10-23 07:27
Core Insights - The "Alpha Arena" AI investment competition initiated by the US research lab nof1.ai is becoming a public test to observe the autonomous trading capabilities of AI models [1][7] - Six major AI models are participating, including Qwen3Max, which currently leads in returns, showcasing its ability to self-optimize through real-time reinforcement learning [1][2] Performance Comparison - Qwen3Max has a return of +19.57%, with an account value of $11,957, outperforming other models significantly [3] - In contrast, Gemini2.5Pro and GPT-5 have experienced losses exceeding 50%, indicating a more aggressive strategy that led to poor performance [2][3] - Qwen3Max's trading behavior reflects a balance of efficiency and stability, with an average holding period of about 7 hours and a return increase from 8.43% to 13.41% [2][3] Strategy and Risk Management - Qwen3Max focuses on opportunity capture and risk balance, executing trades quickly during market volatility while maintaining a low-risk exposure [2] - The competition highlights the differences in risk management and strategy adjustment mechanisms among the AI models, with Qwen3Max demonstrating superior performance [2][4] Technological Advancements - The competition reveals the advantages of reinforcement learning and real-time decision-making capabilities in AI models, which adapt to high-volatility environments [4][7] - Qwen series models are evolving towards a multi-modal capability, enhancing their ability to generate strategies, control risks, and self-correct in complex trading environments [4][7]