强化学习
Search documents
具身机器人赋予了强化学习许多新的应用场景!
具身智能之心· 2025-10-11 00:02
Core Insights - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its application in various complex tasks such as stair climbing, running, and dancing [3][9] - It emphasizes the challenges faced by newcomers in the field of reinforcement learning, particularly in producing academic papers, and introduces a specialized tutoring program to address these challenges [6][10] Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks in challenging environments [3][9] - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robot operations [4][9] Group 2: Educational Program - The program is designed for graduate students and others needing guidance on academic papers, featuring small class sizes and weekly live sessions [8][10] - The course aims to help participants confirm research ideas, implement projects, and produce initial drafts for submission to conferences such as RAL, ICRA, IROS, and CoRL [8][10] Group 3: Course Structure and Content - The course spans 14 weeks of intensive online tutoring followed by 8 weeks of maintenance support, focusing on various aspects of reinforcement learning and its applications [10][19] - Weekly milestones and quantifiable indicators are set to ensure participants complete a draft paper by the end of the course [18][19] Group 4: Learning Outcomes - Participants will gain a comprehensive understanding of reinforcement learning algorithms, simulation environments, and the entire process from research idea to paper submission [23][24] - The program includes practical training on robot tasks and writing guidance, ensuring that even those without mature ideas can develop a publishable paper [17][24]
“推理模型还处于RNN的阶段”——李建忠对话GPT-5与Transformer发明者Lukasz Kaiser实录
AI科技大本营· 2025-10-10 09:52
Core Insights - The dialogue emphasizes the evolution of AI, particularly the transition from language models to reasoning models, highlighting the need for a new level of innovation akin to the Transformer architecture [1][2][4]. Group 1: Language and Intelligence - Language plays a crucial role in AI development, with the emergence of large language models marking a significant leap in AI intelligence [6][8]. - The understanding of language as a time-dependent sequence is essential for expressing intelligence, as it allows for continuous generation and processing of information [7][9]. - Current models exhibit the ability to form abstract concepts, similar to human learning processes, despite criticisms of lacking true understanding [9][10]. Group 2: Multimodal and World Models - The pursuit of unified models for different modalities is ongoing, with current models like GPT-4 already demonstrating multimodal capabilities [12][13]. - There is skepticism regarding the sufficiency of language models alone for achieving AGI, with some experts advocating for world models that learn physical world rules through observation [14][15]. - Improvements in model architecture and data quality are necessary to bridge the gap between language and world models [15][16]. Group 3: AI Programming - AI programming is seen as a significant application of language models, with potential shifts towards natural language-based programming [17][19]. - Two main perspectives on the future of AI programming exist: one advocating for AI-native programming and the other for AI as a copilot, suggesting a hybrid approach [18][20]. Group 4: Agent Models and Generalization - The concept of agent models is discussed, with challenges in generalization to new tasks being a key concern [21][22]. - The effectiveness of agent systems relies on the ability to learn from interactions and utilize external tools, which is currently limited [22][23]. Group 5: Scaling Laws and Computational Limits - The scaling laws in AI development are debated, with concerns about over-reliance on computational power potentially overshadowing algorithmic advancements [24][25]. - The economic limits of scaling models are acknowledged, suggesting a need for new architectures beyond the current paradigms [25][28]. Group 6: Embodied Intelligence - The slow progress in embodied intelligence, particularly in robotics, is attributed to data scarcity and fundamental differences between bits and atoms [29][30]. - Future models capable of understanding and acting in the physical world are anticipated, requiring advancements in multimodal training [30][31]. Group 7: Reinforcement Learning - The shift towards reinforcement learning-driven reasoning models is highlighted, with potential for significant scientific discoveries [32][33]. - The current limitations of RL training methods are acknowledged, emphasizing the need for further exploration and improvement [34]. Group 8: AI Organization and Collaboration - The development of next-generation reasoning models is seen as essential for achieving large-scale agent collaboration [35][36]. - The need for more parallel processing and effective feedback mechanisms in agent systems is emphasized to enhance collaborative capabilities [36][37]. Group 9: Memory and Learning - The limitations of current models' memory capabilities are discussed, with a focus on the need for more sophisticated memory mechanisms [37][38]. - Continuous learning is identified as a critical area for future development, with ongoing efforts to integrate memory tools into models [39][40]. Group 10: Future Directions - The potential for next-generation reasoning models to achieve higher data efficiency and generate innovative insights is highlighted [41].
算力成本大降,马尔可夫思考机来了,LLM推理成本直接降为线性
3 6 Ke· 2025-10-10 07:27
Core Insights - The article discusses the effectiveness and high costs of using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to limit the computational complexity associated with reasoning in LLMs by maintaining a fixed state size [4][20] Group 1: Markovian Thinker Concept - The core idea of the Markovian Thinker is to reconstruct the components of reinforcement learning so that the effective state size remains bounded regardless of the total thinking length [4] - This approach allows longer reasoning processes to require only linear computational resources and constant memory, decoupling the duration of model thinking from the amount of context it must handle [4][20] Group 2: Delethink Implementation - Delethink is a reinforcement learning environment that organizes the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [4][9] - The implementation of Delethink results in linear scaling for both the generation and backpropagation phases, contrasting with the quadratic scaling seen in traditional LongCoT environments [6][15] Group 3: Experimental Results - Experiments show that even with an 8K chunk size, the DeepSeek R1-Distill 1.5B model trained with Delethink can reason up to 24K tokens, outperforming LongCoT-RL in mathematical benchmark tests [9][12] - The model achieved 49% accuracy on a 96K token reasoning task with minimal additional training steps, demonstrating significant efficiency improvements [14][15] Group 4: Implications for Future Models - The success of the Markovian Thinker indicates that decoupling thinking length from context size could enable next-generation reasoning models to handle millions of tokens effectively [20] - The findings suggest that non-quadratic complexity sequence architectures may greatly benefit reasoning models, as the thinking process can be effectively transformed into a Markovian style [20]
算力成本大降!马尔可夫思考机来了,LLM推理成本直接降为线性
机器之心· 2025-10-10 06:36
Core Insights - The article discusses the effectiveness and high costs associated with using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to prevent quadratic growth in computational requirements by maintaining a fixed state size during reasoning [3][9] Group 1: Markovian Thinker - The Markovian Thinker redefines the structure of reinforcement learning to ensure that the effective state size remains bounded regardless of the total thinking length, leading to linear computational requirements [9][32] - The Delethink framework exemplifies this approach by organizing the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [10][12] Group 2: Performance and Efficiency - Experiments show that the Delethink framework allows models to think up to 24K tokens with significant performance improvements over traditional LongCoT methods, even achieving 49% accuracy on complex tasks with 96K tokens [20][23][26] - The computational efficiency of Delethink is highlighted, requiring only 7 H100-months for training compared to 27 H100-months for LongCoT-RL at an average thinking length of 94K tokens [26] Group 3: Implications for Future Models - The success of the Markovian Thinker suggests that decoupling thinking length from context size could enable future reasoning models to handle millions of tokens effectively [32][33] - The findings indicate that non-quadratic complexity architectures may significantly benefit reasoning models, allowing for more efficient processing of thought sequences [33]
DemoGrasp:一次演示是怎么实现灵巧手通用抓取的?
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses DemoGrasp, a novel method for universal dexterous grasping that allows robots to learn grasping strategies from a single demonstration [2][3][6]. Group 1: Methodology - DemoGrasp utilizes a simple and efficient reinforcement learning framework that enables any dexterous hand to learn universal grasping strategies by collecting just one successful grasping demonstration [6]. - The method involves editing the trajectory of robot actions to adapt to new objects and poses, determining grasping positions and methods through adjustments in wrist and hand joint angles [2][3]. Group 2: Performance and Validation - In simulation experiments, DemoGrasp achieved a success rate of 95% when using the Shadow hand to manipulate objects from the DexGraspNet dataset, outperforming existing methods [2]. - The method demonstrated excellent transferability, achieving an average success rate of 84.6% on six unseen object datasets, despite being trained on only 175 objects [2]. Group 3: Applications and Capabilities - The strategy successfully grasped 110 previously unseen real-world objects, including small and thin items, and is adaptable to variations in spatial positioning, background, and lighting [3]. - DemoGrasp supports both RGB and depth input types and can be extended to language-guided grasping tasks in cluttered environments [3].
DexCanvas:具身数据的规模、真实、力觉真的突破不了三缺一吗?
具身智能之心· 2025-10-10 00:02
Core Viewpoint - The article discusses the challenges and advancements in dexterous manipulation in robotics, highlighting the need for high-quality, multi-modal data to improve robotic grasping capabilities and the introduction of the DexCanvas dataset as a solution [1][15]. Group 1: Challenges in Dexterous Manipulation - Dexterous manipulation remains a significant challenge due to the need for precise control, high-dimensional motion planning, and real-time adaptation to dynamic environments [2][11]. - Existing hardware for dexterous manipulation is categorized into two types: two-finger grippers and multi-finger humanoid hands, with the latter being more suitable for complex tasks due to their higher degrees of freedom [2][3]. - Current learning methods for dexterous manipulation include imitation learning and reinforcement learning, each with its own advantages and limitations regarding data requirements and training complexity [4][9]. Group 2: Data Collection and Quality Issues - Data collection for dexterous manipulation is expensive and often lacks tactile and force information, with existing datasets being insufficient for large-scale pre-training [9][10]. - The article emphasizes the trade-off in data collection, where achieving scale, realism, and tactile feedback simultaneously is challenging [6][7]. - The DexCanvas dataset addresses the lack of force and tactile information in existing datasets, providing a comprehensive solution for high-quality data collection [17][21]. Group 3: DexCanvas Dataset Introduction - DexCanvas is a large-scale dataset launched by Lingqiao Intelligent Technology, designed to bridge the gap between cognitive and physical intelligence in robotics [15][16]. - The dataset includes complete multi-finger force/contact annotations optimized for systems with over 20 degrees of freedom, significantly enhancing data quality [17][21]. - DexCanvas offers a structured framework for data collection based on 22 types of human hand operation modes, integrating over 1,000 hours of real human demonstration data and 100,000 hours of physically simulated data [21][22]. Group 4: Data Generation and Enhancement - The dataset generation process involves capturing human demonstrations with high precision and using physical simulation to recover missing force control data [25][27]. - DexCanvas expands the dataset by altering object properties and initial conditions, resulting in a significant increase in data volume while maintaining force control information [28][29]. - Unlike pure simulation, DexCanvas is based on real human demonstrations, allowing for better generalization across different robotic platforms and tasks [30]. Group 5: Industry Impact and Future Prospects - The introduction of DexCanvas is expected to accelerate advancements in the field of robotics by providing essential data for physical interaction, which has been lacking in existing datasets [32]. - The article expresses anticipation for the open-sourcing of the dataset to further enhance research and development in related areas [32].
任少卿的智驾非共识:世界模型、长时序智能体与 “变态” 工程主义
晚点Auto· 2025-10-09 12:17
Core Viewpoint - The article discusses the innovative approach of NIO in the field of autonomous driving, emphasizing the importance of world models and reinforcement learning as key components for achieving advanced artificial general intelligence (AGI) in automotive technology [4][9][26]. Group 1: NIO's Approach to Autonomous Driving - NIO is positioning itself as an AI company, focusing on the development of autonomous driving technology through a unique combination of high computing power, multiple sensors, and a new architecture based on world models and reinforcement learning [5][8][34]. - The company has established a three-layer data system to support its autonomous driving capabilities, which is considered one of the most advanced in the industry [36][54]. - NIO's strategy involves a shift from traditional end-to-end models to a more complex world model that integrates spatial and temporal understanding, aiming to enhance the vehicle's ability to navigate real-world scenarios [10][13][26]. Group 2: Reinforcement Learning and World Models - Reinforcement learning is viewed as essential for developing long-term decision-making capabilities in autonomous systems, moving beyond short-term imitation learning [7][29][33]. - The world model is defined as a high-bandwidth cognitive system that allows AI to understand and predict physical interactions in the environment, which is crucial for effective autonomous driving [10][16][26]. - NIO believes that the integration of language models with world models will lead to a more comprehensive understanding of both concepts and physical realities, ultimately contributing to the development of AGI [13][28][33]. Group 3: Data Utilization and Training - NIO utilizes a combination of real-world driving data and simulated environments, including gaming data, to train its models, ensuring a robust understanding of various driving scenarios [27][30]. - The company emphasizes the importance of using large-scale, diverse datasets for training, as opposed to relying solely on expert data, which may lack the complexity of real-world situations [28][30]. - NIO's approach to data collection and training is designed to enhance the vehicle's performance in edge cases and improve overall safety [41][44]. Group 4: Future Developments and Industry Position - NIO plans to introduce an open-set interaction system that allows for more natural communication between users and the vehicle, moving beyond limited command sets [18][20]. - The company is committed to continuous innovation and exploration in the field of autonomous driving, even if it means facing initial skepticism from the industry [8][25][39]. - NIO's advancements in autonomous driving technology are expected to position it ahead of competitors, particularly with the upcoming release of its open-set interaction capabilities [22][47].
开源RL框架Verlog来了,专为LLM智能体打造,400回合不成问题
机器之心· 2025-10-08 04:13
Core Insights - The article discusses the challenges faced by intelligent agents in maintaining clear reasoning and robust decision-making over long-term tasks, particularly when the task extends to hundreds of steps [2][3] - It introduces Verlog, a multi-turn reinforcement learning framework designed to handle long-horizon tasks effectively, overcoming limitations of traditional frameworks [3][20] Group 1: Framework Overview - Verlog is built on the foundations of VeRL and BALROG, incorporating specialized optimization techniques to ensure stable and efficient training across tasks that can extend beyond 400 steps [3][20] - The framework has been validated in complex environments such as BabyAI, BabaIsAI, and Crafter, demonstrating strong performance in tasks with varying episode lengths [3][19] Group 2: Methodology - The base model for Verlog is the Qwen-2.5 Instruct variant, which allows seamless integration with BALROG and facilitates the use of benchmark testing prompts with minimal modifications [6][7] - A memory mechanism is employed to retain only the latest n + 1 rounds of interactions, optimizing performance for the 3B parameter Qwen model [9][10] Group 3: Algorithmic Innovations - The Dual Discounting GAE algorithm is introduced to decouple tokens from steps, encouraging agents to complete tasks in fewer environment steps [11][20] - The recursive calculation of GAE enhances the stability of training, allowing for effective learning even in sparse reward scenarios [12][14] Group 4: Experimental Results - Verlog was tested on three challenging benchmarks: Crafter, BabyAI, and BabaIsAI, showcasing its ability to adapt to long-duration tasks with sparse rewards [16][19] - The training of the Qwen2.5-7B-Instruct model in the Crafter environment utilized 8 H100 GPUs over approximately 36 hours, while the Qwen2.5-3B-Instruct model for BabyAI and BabaIsAI was trained on 4 A40 GPUs for about 24 hours [19] Group 5: Future Directions - Verlog aims to serve as a flexible research platform to advance the development of long-horizon LLM-Agent reinforcement learning [21][20] - The framework addresses key engineering challenges such as managing long interaction histories, ensuring training stability under sparse rewards, and handling variable trajectory lengths [20][23]
我们正在找具身领域的合伙人......
具身智能之心· 2025-10-08 02:49
Core Viewpoint - The company is seeking collaboration with global practitioners in the embodied intelligence field to enhance capabilities in various areas such as technical services, training, course development, and research guidance [1]. Group 1: Collaboration Opportunities - There is an increasing demand from partners and small companies for the company to empower them through solutions, data collection, technology upgrades, and corporate training [1]. - The company is inviting outstanding partners to join in driving significant industry progress [1]. Group 2: Compensation and Resources - The company will offer high compensation and abundant industry resources to collaborators [2]. Group 3: Focus Areas - Key focus areas for collaboration include but are not limited to: VLA, VLN, Diffusion Policy, Reinforcement Learning, VLA+RL, remote operation, motion capture, sim2real, multimodal large models, simulation, motion control, end-to-end systems, and 3D perception [3]. Group 4: Job Description - The positions are primarily aimed at embodied course development, solution research and development, hardware development, and training collaboration, targeting both B-end (enterprises, universities, research institutes) and C-end (students, job seekers) [4]. Group 5: Contact Information - Interested parties can add WeChat oooops-life for further inquiries [5].
“盲眼”机器人在完全看不见的情况下30秒跑酷首秀惊艳!
具身智能之心· 2025-10-07 03:03
Core Insights - The article discusses the advancements in humanoid robotics, specifically focusing on Amazon's FAR (Frontier AI for Robotics) team and their new technology, OmniRetarget, which enables robots to perform complex tasks without visual sensors [9][49]. Group 1: OmniRetarget Technology - OmniRetarget allows reinforcement learning strategies to learn long-term loco-manipulation skills in complex environments, achieving zero-shot transfer from simulation to humanoid robots [12][29]. - The technology utilizes an interaction mesh to model spatial and contact relationships between the robot, objects, and terrain, enhancing data efficiency and reducing data collection costs [15][25]. - OmniRetarget outperforms other motion redirection methods in key areas such as hard constraints, object interaction, terrain interaction, and data augmentation [16][40]. Group 2: Experimental Results - The research team demonstrated the broad capabilities of OmniRetarget, including natural object manipulation and terrain interaction, achieving a success rate of 79.1% on enhanced datasets [39][42]. - In comparative tests, OmniRetarget showed superior performance in kinematic quality metrics, such as penetration and contact preservation, outperforming baseline methods [41][42]. - The technology's high-quality redirection actions directly improve downstream reinforcement learning policy success rates by over 10% compared to baseline methods [42]. Group 3: Team and Background - Amazon's FAR team, established recently, is led by prominent scholars from the robotics field, including those from the renowned Covariant company [43][44]. - The team aims to revolutionize automation in humanoid robotics, marking Amazon's first significant foray into this area [49][50].