强化学习
Search documents
亚马逊“盲眼”机器人30秒跑酷首秀惊艳!华人学者领衔
量子位· 2025-10-06 05:42
henry 发自 凹非寺 量子位 | 公众号 QbitAI 你见过这样的"盲眼"机器人demo吗? 它在完全看不见的情况下——没有摄像头、雷达或任何感知单元——主动搬起9斤重的椅子,爬上1米高的桌子,然后翻跟头跳下。 不光耍酷,干起活来,搬箱子也不在话下。 还能一个猛子跳上桌子。 手脚并用爬坡也照样OK。 这些丝滑小连招来自 亚马逊机器人团队FAR (Frontier AI for Robotics)发布的 首个 人形机器人(足式)研究成果—— OmniRetarget ! OmniRetarget使强化学习策略能够在复杂环境中学习长时程的"移-操一体"(loco-manipulation)技能,并实现从仿真到人形机器人的零样本 迁移。 网友表示:又能跑酷、还能干活,这不比特斯拉的擎天柱强10倍? 此外,保留任务相关的交互使得数据能够进行高效的数据增强,进而从单个演示推广到不同的机器人本体、地形和物体配置,以减少不同变体 的数据收集成本。 在与其他动作重定向方法的对比中,OmniRetarget在所有关键方面:硬约束、物体交互、地形交互、数据增强表现出了全面的方法优势。 | Methods | Hard Ki ...
强化学习在机械臂、四足、人形的应用有哪些?
具身智能之心· 2025-10-05 16:03
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its applications in various complex tasks and the challenges faced by newcomers in the field [3][4][10]. Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks such as climbing stairs, running, and dancing [3][9]. - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robot operations [4][9]. Group 2: Challenges in Learning and Research - The complexity and breadth of reinforcement learning make it difficult for beginners to enter the field, often leading to frustration and abandonment of studies [6][10]. - A lack of a comprehensive learning system can result in repeated mistakes and missed opportunities for aspiring researchers [7][10]. Group 3: Educational Offerings - To address the challenges faced by newcomers, the company has launched a 1v6 paper guidance small class in the field of reinforcement learning, aimed at graduate students and others needing paper guidance [7][8]. - The course includes 14 weeks of concentrated online guidance followed by 8 weeks of maintenance support, focusing on paper idea confirmation, project implementation, experimental guidance, and writing refinement [10][12]. Group 4: Course Structure and Content - The course covers various topics, including paper direction and submission analysis, reinforcement learning basics, simulation environments, and writing guidance [10][18]. - Students will have the opportunity to work on specific ideas related to quadruped robots, humanoid robots, and robotic arms, with a structured approach to developing a paper suitable for submission to top conferences [19][30]. Group 5: Expected Outcomes - Participants are expected to produce a draft of a paper that meets the requirements of specific conferences or journals, with support for writing and submission processes [29][34]. - The course emphasizes a comprehensive research cycle, including methodology, engineering, evaluation, writing, submission, and maintenance [36].
从「知题」到「知人」:UserRL让智能体学会「以人为本」
机器之心· 2025-10-05 06:42
"知人者智,自知者明。"——《道德经》 古人早已洞见:真正的人类智慧,不仅仅在于公式推演、掌握技艺,更是能理解他人、洞察人心。今天的大语言模型已能在代码、数学与工具使用上 出色 地完 成 任务 ,然而距离成为真正的 用户伙伴 ,它们依旧缺少那份 "知人" 的能力。这主要源于现实交互远比解题更加复杂: 这正是智能体面临的下一个时代课题: 从 "会解题" 迈向 "懂用户" 。而要真正回答这一课题,我们需要全新的动态评测框架与训练机制:不仅能测量模型在交互 中的表现,还能驱动其学会在用户不确定与多目标的世界里,问之有道,断之有衡,答之有据。为此,来自 UIUC 与 Salesforce 的研究团队提出了一套系统化方 案: 二者相辅相成,把 "以用户为中心" 从理念落地为 可复现的流程、接口与评测指标 。 UserBench 论文链接:https://arxiv.org/pdf/2507.22034 UserBench 代码仓库:https://github.com/SalesforceAIResearch/UserBench 现实交互中, 用户目标常常未在最初完全成形 (underspecification)、而是 ...
仅需 1 次演示,机器人就能像人手一样抓遍万物?DemoGrasp 刷新灵巧抓取天花板
具身智能之心· 2025-10-04 13:35
Core Viewpoint - The article discusses the innovative DemoGrasp framework, which enables robots to perform dexterous grasping tasks with a single demonstration, overcoming traditional challenges in robotic manipulation [2][20]. Group 1: Traditional Challenges in Robotic Grasping - Traditional reinforcement learning methods struggle with high-dimensional action spaces, requiring complex reward functions and often leading to poor generalization [1][2]. - Robots trained in simulation often fail in real-world scenarios due to the lack of precise physical parameters and environmental variations [1][2]. Group 2: Introduction of DemoGrasp - DemoGrasp, developed by a collaboration of Beijing University, Renmin University of China, and BeingBeyond, utilizes a single successful demonstration to redefine grasping tasks [2][4]. - The framework significantly improves performance in both simulated and real environments, marking a breakthrough in robotic grasping technology [2][4]. Group 3: Core Design of DemoGrasp - The core innovation of DemoGrasp includes three main components: demonstration trajectory editing, single-step reinforcement learning (RL), and visual-guided virtual-real transfer [4][10]. - The design allows robots to optimize "editing parameters" instead of exploring new actions, greatly reducing the dimensionality of the action space [6][7]. Group 4: Performance Results - DemoGrasp outperforms existing methods in simulation, achieving a success rate of 95.5% in testing with seen categories and 94.4% with unseen categories [10]. - The framework adapts to six different robotic embodiments without hyperparameter adjustments, achieving an average success rate of 84.6% on unseen datasets [11]. Group 5: Real-World Performance - In real-world tests, DemoGrasp achieved an overall success rate of 86.5% across 110 unseen objects, demonstrating its capability to handle various everyday items [14]. - The framework successfully grasps small and thin objects, such as coins and cards, which traditional methods struggle with due to collision issues [14]. Group 6: Limitations and Future Directions - Despite its strengths, DemoGrasp has limitations in handling functional grasping tasks and highly cluttered scenes [17][19]. - Future improvements may include segmenting demonstration trajectories for better decision-making and integrating visual feedback for dynamic scene adjustments [19][20].
北大校友、华人学者金驰新身份——普林斯顿大学终身副教授
机器之心· 2025-10-04 05:30
Core Insights - Chi Jin, a Chinese scholar, has been promoted to tenured associate professor at Princeton University, effective January 16, 2026, marking a significant milestone in his academic career and recognition of his foundational contributions to machine learning theory [1][4]. Group 1: Academic Contributions - Jin joined Princeton's Department of Electrical Engineering and Computer Science in 2019 and has rapidly gained influence in the AI field over his six-year tenure [3]. - His work addresses fundamental challenges in deep learning, particularly the effectiveness of simple optimization methods like Stochastic Gradient Descent (SGD) in non-convex optimization scenarios [8][12]. - Jin's research has established a theoretical foundation for two core issues: efficient training of large and complex models, and ensuring these models are reliable and beneficial in human interactions [11]. Group 2: Non-Convex Optimization - One of the main challenges in deep learning is non-convex optimization, where loss functions have multiple local minima and saddle points, complicating the optimization process [12]. - Jin has demonstrated through multiple papers that even simple gradient methods can effectively escape saddle points with the presence of minimal noise, allowing for continued exploration towards better solutions [12][17]. - His findings have provided a theoretical basis for the practical success of deep learning, alleviating concerns about the robustness of optimization processes in large-scale model training [18]. Group 3: Reinforcement Learning - Jin's research has also significantly advanced the field of reinforcement learning (RL), particularly in establishing sample efficiency, which is crucial for applications with high interaction costs [19]. - He has provided rigorous regret bounds for foundational RL algorithms, proving that model-free algorithms like Q-learning can maintain sample efficiency even in complex settings [22]. - This theoretical groundwork not only addresses academic inquiries but also guides the development of more robust RL algorithms for deployment in high-risk applications [23]. Group 4: Academic Background - Jin holds a Bachelor's degree in Physics from Peking University and a Ph.D. in Electrical Engineering and Computer Science from the University of California, Berkeley, where he was mentored by renowned professor Michael I. Jordan [25]. - His academic background has equipped him with a strong foundation in mathematical and analytical thinking, essential for his theoretical research in AI and machine learning [25]. Group 5: Recognition and Impact - Jin, along with other scholars, received the 2024 Sloan Award, highlighting his contributions to the field [6]. - His papers have garnered significant citations, with a total of 13,588 citations on Google Scholar, indicating the impact of his research in the academic community [27].
理想基座模型负责人近期很满意的工作: RuscaRL
理想TOP2· 2025-10-03 09:55
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in enhancing the intelligence of large models, emphasizing the need for effective interaction between models and their environments to obtain high-quality feedback [1][2]. Summary by Sections Section 1: Importance of Reinforcement Learning - The article highlights that RL is crucial for the advancement of large model intelligence, with a focus on how to enable models to interact with broader environments to achieve capability generalization [1][8]. - It mentions various RL techniques such as RLHF (Reinforcement Learning from Human Feedback), RLAIF (AI Feedback Reinforcement Learning), and RLVR (Verifiable Reward Reinforcement Learning) as key areas of exploration [1][8]. Section 2: RuscaRL Framework - The RuscaRL framework is introduced as a solution to the exploration bottleneck in RL, utilizing educational psychology's scaffolding theory to enhance the reasoning capabilities of large language models (LLMs) [12][13]. - The framework employs explicit scaffolding and verifiable rewards to guide model training and improve response quality [13][15]. Section 3: Mechanisms of RuscaRL - **Explicit Scaffolding**: This mechanism provides structured guidance through rubrics, helping models generate diverse and high-quality responses while gradually reducing external support as the model's capabilities improve [14]. - **Verifiable Rewards**: RuscaRL designs rewards based on rubrics, allowing for stable and reliable feedback during training, which enhances exploration diversity and ensures knowledge consistency across tasks [15][16]. Section 4: Future Implications - The article suggests that both MindGPT and MindVLA, which target digital and physical worlds respectively, could benefit from the advancements made through RuscaRL, indicating a promising future for self-evolving models [9][10]. - It emphasizes that the current challenges in RL are not just algorithmic but also involve systemic integration of algorithms and infrastructure, highlighting the need for innovative approaches in building capabilities [9].
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
Core Insights - Google DeepMind's Dreamer 4 supports the idea that agents can learn skills for interacting with the physical world through imagination without direct interaction [2][4] - Dreamer 4 is the first agent to obtain diamonds in the challenging game Minecraft solely from standard offline datasets, demonstrating significant advancements in offline learning [7][21] Group 1: World Model and Training - World models enable agents to understand the world deeply and select successful actions by predicting future outcomes from their perspective [4] - Dreamer 4 utilizes a novel shortcut forcing objective and an efficient Transformer architecture to accurately learn complex object interactions while allowing real-time human interaction on a single GPU [11][19] - The model can be trained on large amounts of unlabeled video data, requiring only a small amount of action-paired video, opening possibilities for learning general world knowledge from diverse online videos [13] Group 2: Experimental Results - In the offline diamond challenge, Dreamer 4 significantly outperformed OpenAI's offline agent VPT15, achieving success with 100 times less data [22] - Dreamer 4's performance in acquiring key items and the time taken to obtain them surpassed behavior cloning methods, indicating that world model representations are superior for decision-making [24] - The agent demonstrated a high success rate in various tasks, achieving 14 out of 16 successful interactions in the Minecraft environment, showcasing its robust capabilities [29] Group 3: Action Generation - Dreamer 4 achieved a PSNR of 53% and SSIM of 75% with only 10 hours of action training, indicating that the world model absorbs most knowledge from unlabeled videos with minimal action data [32]
SemiAnalysis创始人Dylan最新访谈--AI、半导体和中美
傅里叶的猫· 2025-10-01 14:43
Core Insights - The article discusses the insights from a podcast featuring Dylan Patel, founder of SemiAnalysis, focusing on the semiconductor industry and AI computing demands, particularly the collaboration between OpenAI and Nvidia [2][4][20]. OpenAI and Nvidia Collaboration - OpenAI's partnership with Nvidia is not merely a financial arrangement but a strategic move to meet its substantial computing needs for model training and operation [4][5]. - OpenAI has 800 million users but generates only $1.5 to $2 billion in revenue, facing competition from trillion-dollar companies like Meta and Google [4][5]. - Nvidia's investment of $10 billion in OpenAI aims to support the construction of a 10GW cluster, with Nvidia capturing a significant portion of GPU orders [5][6]. AI Industry Dynamics - The AI industry is characterized by a race to build computing clusters, where the first to establish such infrastructure gains a competitive edge [7]. - The risk for OpenAI lies in its ability to convert its investments into sustainable revenue, especially given its $30 billion contract with Oracle [6][20]. Model Scaling and Returns - Dylan argues against the notion of diminishing returns in model training, suggesting that significant computational increases can lead to substantial performance improvements [8][9]. - The current state of AI development is likened to a "high school" level of capability, with potential for growth akin to "college graduate" levels [9]. Tokenomics and Inference Demand - The concept of "tokenomics" is introduced, emphasizing the economic value of AI outputs relative to computational costs [10][11]. - OpenAI faces challenges in maximizing its computing capacity while managing rapidly doubling inference demands every two months [10][11]. Reinforcement Learning and Memory Mechanisms - Reinforcement learning is highlighted as a critical area for AI development, where models learn through iterative interactions with their environment [12][13]. - The need for improved memory mechanisms in AI models is discussed, with a focus on optimizing long-context processing [12]. Hardware, Power, and Supply Chain Issues - AI data centers currently consume 3-4% of the U.S. electricity, with significant pressure on the power grid due to the rapid growth of AI infrastructure [14][15]. - The industry is facing labor shortages and supply chain challenges, particularly in the construction of new data centers and power generation facilities [17]. U.S.-China AI Stack Differences and Geopolitical Risks - Dylan emphasizes that without AI, the U.S. risks losing its global dominance, while China is making long-term investments in various sectors, including semiconductors [18][19]. Company Perspectives - OpenAI is viewed positively but criticized for its scattered focus across various applications, which may dilute its execution capabilities [20][21]. - Anthropic is seen as a strong competitor due to its concentrated efforts in software development, particularly in the coding market [21]. - AMD is recognized for its competitive pricing but lacks revolutionary breakthroughs compared to Nvidia [22]. - xAI's potential is acknowledged, but concerns about its business model and funding challenges are raised [23]. - Oracle is positioned as a low-risk player benefiting from its established cloud business, contrasting with OpenAI's high-stakes approach [24]. - Meta is viewed as having a comprehensive strategy with significant potential, while Google is seen as having made a notable turnaround in its AI strategy [25][26].
全新合成框架SOTA:强化学习当引擎,任务合成当燃料,蚂蚁港大联合出品
量子位· 2025-10-01 03:03
Core Insights - The article discusses the launch of PromptCoT 2.0 by Ant Group and the University of Hong Kong, focusing on the direction of task synthesis in the second half of large models [1][5] - The team emphasizes the importance of task synthesis and reinforcement learning as foundational technologies for advancing large models and intelligent agents [6][7] Summary by Sections Introduction to PromptCoT 2.0 - PromptCoT 2.0 represents a comprehensive upgrade of the PromptCoT framework, which was initially introduced a year ago [4][16] - The framework aims to enhance the capabilities of large models by focusing on task synthesis, particularly in the context of complex real-world problems [5][9] Importance of Task Synthesis - Task synthesis is viewed as a critical area that includes problem synthesis, answer synthesis, environment synthesis, and evaluation synthesis [9] - The team believes that without a sufficient amount of high-quality task data, reinforcement learning cannot be effectively utilized [9] Framework and Methodology - The team has developed a general and powerful problem synthesis framework, breaking it down into concept extraction, logic generation, and problem generation model training [10][13] - PromptCoT 2.0 introduces an Expectation-Maximization (EM) cycle to optimize the reasoning chain iteratively, resulting in more challenging and diverse problem generation [15][23] Performance and Data Upgrades - PromptCoT 2.0 has shown significant improvements in performance, allowing strong reasoning models to achieve new state-of-the-art results [17] - The framework has generated 4.77 million synthetic problems, which exhibit higher difficulty and greater differentiation compared to existing datasets [19][20] Future Directions - The team plans to explore agentic environment synthesis, multi-modal task synthesis, and self-rewarding mechanisms to further enhance the capabilities of large models [27][28] - The integration of self-rewarding and game-theoretic approaches is seen as a potential avenue for improving model performance [29]
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
Core Insights - The article discusses the significant advancements in reinforcement learning (RL) techniques that enhance the capabilities of large language models (LLMs), particularly in understanding human intent and following user instructions [2][3] - A comprehensive survey titled "Reinforcement Learning Meets Large Language Models" has been conducted by researchers from top institutions, summarizing the role of RL throughout the entire lifecycle of LLMs [2][3] Summary by Sections Overview of Reinforcement Learning in LLMs - The survey details the application strategies of RL in various stages of LLMs, including pre-training, alignment fine-tuning, and reinforcement reasoning [3][6] - It organizes existing datasets, evaluation benchmarks, and mainstream open-source tools and training frameworks relevant to RL fine-tuning, providing a clear reference for future research [3][6] Lifecycle of LLMs - The article systematically covers the complete application lifecycle of RL in LLMs, detailing the objectives, methods, and challenges faced at each stage from pre-training to reinforcement [11][12] - A classification overview of the operational methods of RL in LLMs is presented, highlighting the interconnections between different stages [5][6] Focus on Verifiable Rewards - The survey emphasizes the focus on Reinforcement Learning with Verifiable Rewards (RLVR), summarizing its applications in enhancing reasoning stability and accuracy in LLMs [7][9] - It discusses how RLVR optimizes the reasoning process and improves the model's adaptability to complex tasks through automatically verifiable reward mechanisms [7][9] Key Contributions - The article identifies three main contributions: a comprehensive lifecycle overview of RL applications in LLMs, a focus on advanced RLVR techniques, and the integration of key research resources essential for experiments and evaluations [9][11] - It provides valuable references for researchers interested in exploring RL in the context of LLMs [11][12] Challenges and Future Directions - Despite significant progress, challenges remain in scalability and training stability for large-scale RL applications in LLMs, which are still computationally intensive and often unstable [12][13] - Issues related to reward design and credit assignment, particularly in long-term reasoning, pose difficulties for model learning [12][13] - The article highlights the need for standardized datasets and evaluation benchmarks to facilitate comparison and validation of RL fine-tuning methods [12][13]