Workflow
强化学习
icon
Search documents
光会“看”和“说”还不够,还得会“算”!Tool-Use+强化学习:TIGeR让机器人实现精准操作
具身智能之心· 2025-10-11 16:02
Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in accurately interpreting and executing spatial commands in robotics, emphasizing the need for precise geometric reasoning and tool integration [2][5]. Group 1: TIGeR Framework - The Tool-Integrated Geometric Reasoning (TIGeR) framework enhances VLMs by integrating tool usage and reinforcement learning to improve their ability to perform precise calculations in a three-dimensional space [2][6]. - TIGeR allows AI models to transition from qualitative perception to quantitative computation, addressing the core pain points of existing VLMs [2][7]. Group 2: Advantages of TIGeR - TIGeR provides precise localization by integrating depth information and camera parameters, enabling the accurate conversion of commands like "10 centimeters above" into three-dimensional coordinates [7]. - The framework supports multi-view unified reasoning, allowing information from various perspectives to be merged and reasoned within a consistent world coordinate system [7]. - The model's reasoning process is transparent, making it easier to debug and optimize by clearly showing the tools used, parameters input, and results obtained [7]. Group 3: Training Process - The training of TIGeR involves a two-phase process: first, supervised learning to teach basic tool usage and reasoning chains, followed by reinforcement learning to refine the model's tool usage skills through a hierarchical reward mechanism [8][10]. - The hierarchical reward mechanism evaluates not only the correctness of the final answer but also the accuracy of the process, including tool selection and parameter precision [8]. Group 4: Data Utilization - The TIGeR-300K dataset, consisting of 300,000 samples, was created to train the model in solving geometric problems, ensuring both accuracy and diversity in the tasks covered [10][13]. - The dataset construction involved template-based generation and large model rewriting to enhance generalization and flexibility, ensuring the model can handle complex real-world instructions [13]. Group 5: Performance Metrics - TIGeR outperforms other leading VLMs in spatial understanding benchmarks, achieving scores such as 93.85 in 2D-Rel and 96.33 in 3D-Depth [10][14]. - The model's performance in various spatial reasoning tasks demonstrates its capability to execute operations that require precise three-dimensional positioning, which other models struggle to achieve [16].
港中文(深圳)冀晓强教授实验室全奖招收博士/博士后
具身智能之心· 2025-10-11 16:02
Core Viewpoint - The article emphasizes the opportunities in the field of embodied intelligence, highlighting the need for skilled researchers and the benefits of joining a collaborative academic environment focused on artificial intelligence and robotics. Research Content - The research focuses on interdisciplinary areas such as artificial intelligence control theory, embodied intelligence control, and reinforcement learning control [11]. - Candidates are expected to have a deep understanding and interest in core research directions, with the ability to conduct theoretical innovation and experimental validation independently [2]. Candidate Requirements - **Postdoctoral Researchers**: Must hold a PhD in relevant fields from prestigious institutions, with a strong publication record in top-tier journals or conferences [2]. - **PhD Candidates**: Should possess a master's degree or an outstanding bachelor's degree in related disciplines [3]. - **Master's Candidates**: Expected to have a bachelor's degree in relevant fields from recognized universities [5]. - Candidates should demonstrate a solid foundation in mathematics and programming, with a keen interest in control theory, AI, and robotics [4]. Skills and Experience - Familiarity with deep learning and AI models such as CLIP, BLIP, and LLaVA is essential [6]. - Experience with classic models like VAE, Transformer, and BERT, along with strong algorithm design and programming skills, particularly in high-performance languages like C++ or Rust, is preferred [7][8]. - Practical experience in training, tuning, and deploying deep learning models is highly valued [12]. Mentor Introduction - Professor Ji Xiaoqiang, with a PhD from Columbia University, leads the AI Control and Decision Laboratory at The Chinese University of Hong Kong (Shenzhen) [13]. - His research focuses on intelligent control systems, and he has published over 50 papers in top international journals and conferences [13]. Benefits and Compensation - **Postdoctoral Researchers**: Eligible for annual pre-tax living allowances of 210,000 CNY, with additional subsidies and potential for significant research funding [14]. - **PhD Candidates**: Full or half scholarships available, with top candidates eligible for a principal's scholarship of 180,000 CNY per year [15]. - **Master's Candidates**: Opportunities for transitioning to PhD programs and additional living stipends for outstanding candidates [16]. Application Materials - Applicants must submit a complete CV in both Chinese and English, along with any published papers and evidence of research capabilities [19].
腾讯开源强化学习新算法!让智能体无需专家示范就“自学成才”,还即插即用零成本接入
量子位· 2025-10-11 06:04
Youtu-Agent 团队 投稿 量子位 | 公众号 QbitAI 让智能体自己摸索新方法,还模仿自己的成功经验。 腾讯优图实验室 开源 强化学习算法—— SPEAR (Self-imitation with Progressive Exploration for Agentic Reinforcement Learning)。 主打一个让AI自学成才! 该算法首次让大语言模型(LLM)驱动的智能体在无需大量专家示范的情况下,通过"自我模仿+渐进探索"实现熵稳定的学习过程。 在ALFWorld、WebShop、AIME24/25等基准上 平均提升16%以上 ,刷新业界最佳成绩,为长周期、稀疏奖励场景下的智能体训练提供了 即插即用 的新范式。 △ SPEAR算法核心概念示意图 简单来说,SPEAR算法既能大胆尝试新方法,又能靠谱地用已经验证过的有效策略,不用走极端。 下面具体来看。 传统自我模仿学习是什么? 想象一位新手厨师: 自我模仿学习(Self-Imitation Learning,SIL)就是把这套"只抄自己最好的作业"的思路搬进强化学习: 自我模仿 2.0:自己产出的"神操作"自己学 熵控崩溃终结者 ...
具身机器人赋予了强化学习许多新的应用场景!
具身智能之心· 2025-10-11 00:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 强化学习的主要功能与落地场景 说到具身智能机器人,无论是人形还是四足,都离不开的一个重要任务是步态控制,这也是迈向通用具身 必须要攻克的难关。而目前主要方案即是强化学习,宇树、智元等公司的人形机器人大多通过强化学习完 成对应任务,包括:爬楼梯、爬山、跑步、跳舞、翻跟头等各类高难度动作的学习,从而赋予产品能够适 应救援、测量、危险环境的场景。 除此之外机械臂的VLA+RL方案在学术领域越来越受欢迎,RL让机器人执行的更高效、丝滑与顺畅。 然而,强化学习涉及内容众多,而且非常吃研究经验。体系较大、内容繁杂,很多小白根本不知道怎么入 门,发出一篇论文更是难度极大。产出一篇符合对应标准的论文需要在方法论证、实验结果、写作方式等 几个大模块上突击。哪一环节出错了,都可能导致审稿人的low score。 没有完整的学习体系,将会处处踩坑,久久不能入门,导致最 ...
“推理模型还处于RNN的阶段”——李建忠对话GPT-5与Transformer发明者Lukasz Kaiser实录
AI科技大本营· 2025-10-10 09:52
Core Insights - The dialogue emphasizes the evolution of AI, particularly the transition from language models to reasoning models, highlighting the need for a new level of innovation akin to the Transformer architecture [1][2][4]. Group 1: Language and Intelligence - Language plays a crucial role in AI development, with the emergence of large language models marking a significant leap in AI intelligence [6][8]. - The understanding of language as a time-dependent sequence is essential for expressing intelligence, as it allows for continuous generation and processing of information [7][9]. - Current models exhibit the ability to form abstract concepts, similar to human learning processes, despite criticisms of lacking true understanding [9][10]. Group 2: Multimodal and World Models - The pursuit of unified models for different modalities is ongoing, with current models like GPT-4 already demonstrating multimodal capabilities [12][13]. - There is skepticism regarding the sufficiency of language models alone for achieving AGI, with some experts advocating for world models that learn physical world rules through observation [14][15]. - Improvements in model architecture and data quality are necessary to bridge the gap between language and world models [15][16]. Group 3: AI Programming - AI programming is seen as a significant application of language models, with potential shifts towards natural language-based programming [17][19]. - Two main perspectives on the future of AI programming exist: one advocating for AI-native programming and the other for AI as a copilot, suggesting a hybrid approach [18][20]. Group 4: Agent Models and Generalization - The concept of agent models is discussed, with challenges in generalization to new tasks being a key concern [21][22]. - The effectiveness of agent systems relies on the ability to learn from interactions and utilize external tools, which is currently limited [22][23]. Group 5: Scaling Laws and Computational Limits - The scaling laws in AI development are debated, with concerns about over-reliance on computational power potentially overshadowing algorithmic advancements [24][25]. - The economic limits of scaling models are acknowledged, suggesting a need for new architectures beyond the current paradigms [25][28]. Group 6: Embodied Intelligence - The slow progress in embodied intelligence, particularly in robotics, is attributed to data scarcity and fundamental differences between bits and atoms [29][30]. - Future models capable of understanding and acting in the physical world are anticipated, requiring advancements in multimodal training [30][31]. Group 7: Reinforcement Learning - The shift towards reinforcement learning-driven reasoning models is highlighted, with potential for significant scientific discoveries [32][33]. - The current limitations of RL training methods are acknowledged, emphasizing the need for further exploration and improvement [34]. Group 8: AI Organization and Collaboration - The development of next-generation reasoning models is seen as essential for achieving large-scale agent collaboration [35][36]. - The need for more parallel processing and effective feedback mechanisms in agent systems is emphasized to enhance collaborative capabilities [36][37]. Group 9: Memory and Learning - The limitations of current models' memory capabilities are discussed, with a focus on the need for more sophisticated memory mechanisms [37][38]. - Continuous learning is identified as a critical area for future development, with ongoing efforts to integrate memory tools into models [39][40]. Group 10: Future Directions - The potential for next-generation reasoning models to achieve higher data efficiency and generate innovative insights is highlighted [41].
算力成本大降,马尔可夫思考机来了,LLM推理成本直接降为线性
3 6 Ke· 2025-10-10 07:27
Core Insights - The article discusses the effectiveness and high costs of using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to limit the computational complexity associated with reasoning in LLMs by maintaining a fixed state size [4][20] Group 1: Markovian Thinker Concept - The core idea of the Markovian Thinker is to reconstruct the components of reinforcement learning so that the effective state size remains bounded regardless of the total thinking length [4] - This approach allows longer reasoning processes to require only linear computational resources and constant memory, decoupling the duration of model thinking from the amount of context it must handle [4][20] Group 2: Delethink Implementation - Delethink is a reinforcement learning environment that organizes the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [4][9] - The implementation of Delethink results in linear scaling for both the generation and backpropagation phases, contrasting with the quadratic scaling seen in traditional LongCoT environments [6][15] Group 3: Experimental Results - Experiments show that even with an 8K chunk size, the DeepSeek R1-Distill 1.5B model trained with Delethink can reason up to 24K tokens, outperforming LongCoT-RL in mathematical benchmark tests [9][12] - The model achieved 49% accuracy on a 96K token reasoning task with minimal additional training steps, demonstrating significant efficiency improvements [14][15] Group 4: Implications for Future Models - The success of the Markovian Thinker indicates that decoupling thinking length from context size could enable next-generation reasoning models to handle millions of tokens effectively [20] - The findings suggest that non-quadratic complexity sequence architectures may greatly benefit reasoning models, as the thinking process can be effectively transformed into a Markovian style [20]
算力成本大降!马尔可夫思考机来了,LLM推理成本直接降为线性
机器之心· 2025-10-10 06:36
Core Insights - The article discusses the effectiveness and high costs associated with using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to prevent quadratic growth in computational requirements by maintaining a fixed state size during reasoning [3][9] Group 1: Markovian Thinker - The Markovian Thinker redefines the structure of reinforcement learning to ensure that the effective state size remains bounded regardless of the total thinking length, leading to linear computational requirements [9][32] - The Delethink framework exemplifies this approach by organizing the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [10][12] Group 2: Performance and Efficiency - Experiments show that the Delethink framework allows models to think up to 24K tokens with significant performance improvements over traditional LongCoT methods, even achieving 49% accuracy on complex tasks with 96K tokens [20][23][26] - The computational efficiency of Delethink is highlighted, requiring only 7 H100-months for training compared to 27 H100-months for LongCoT-RL at an average thinking length of 94K tokens [26] Group 3: Implications for Future Models - The success of the Markovian Thinker suggests that decoupling thinking length from context size could enable future reasoning models to handle millions of tokens effectively [32][33] - The findings indicate that non-quadratic complexity architectures may significantly benefit reasoning models, allowing for more efficient processing of thought sequences [33]
DemoGrasp:一次演示是怎么实现灵巧手通用抓取的?
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses DemoGrasp, a novel method for universal dexterous grasping that allows robots to learn grasping strategies from a single demonstration [2][3][6]. Group 1: Methodology - DemoGrasp utilizes a simple and efficient reinforcement learning framework that enables any dexterous hand to learn universal grasping strategies by collecting just one successful grasping demonstration [6]. - The method involves editing the trajectory of robot actions to adapt to new objects and poses, determining grasping positions and methods through adjustments in wrist and hand joint angles [2][3]. Group 2: Performance and Validation - In simulation experiments, DemoGrasp achieved a success rate of 95% when using the Shadow hand to manipulate objects from the DexGraspNet dataset, outperforming existing methods [2]. - The method demonstrated excellent transferability, achieving an average success rate of 84.6% on six unseen object datasets, despite being trained on only 175 objects [2]. Group 3: Applications and Capabilities - The strategy successfully grasped 110 previously unseen real-world objects, including small and thin items, and is adaptable to variations in spatial positioning, background, and lighting [3]. - DemoGrasp supports both RGB and depth input types and can be extended to language-guided grasping tasks in cluttered environments [3].
DexCanvas:具身数据的规模、真实、力觉真的突破不了三缺一吗?
具身智能之心· 2025-10-10 00:02
灵巧抓取为什么这么难? 近两年,具身领域在认知、感知和规划层面取得了显著进展,但让机器人在物理世界中实现精细手部操控、像人类一样执行复杂的灵巧操作, 仍是非常大的难题。目前具身领域已经突破了人类语言理解、物体和场景识别、规划具体任务步骤,但在灵活抓握、感知调节力度等方向还存 在很多问题。 真实场景中,灵巧抓取会面临精确控制、高维运动规划和实时适应动态环境等挑战,任务复杂性要求强大的机械设计和先进控制算法。 而灵巧操作背后的硬件主要是灵巧手,又可以分为两类:两指夹爪和多指拟人化手。两指夹具因其可靠性、简单性和易于控制而被广泛使用。 但这类硬件通常只有一个自由度,很难适配一些复杂任务。为此,类人的具备20+自由度的灵巧手应允而生。这些拟人化手更适合与为人类设计 的物体和环境进行交互。 1)现有灵巧抓取与数据采集方案 虽然国内外各大机器人公司都在发布海量数据集:百万级轨迹、千小时演示,但却缺乏相关力控信息。灵巧手数据好像一直脱离不开这样的定 律:scale、真实、力觉只能三选二。 数据获取方式决定了不能既要、又要、还要! 目前灵巧抓取的学习方法主要分为2类:强化学习和模仿学习。 模仿学习无需构建复杂世界模型和设计奖 ...
任少卿的智驾非共识:世界模型、长时序智能体与 “变态” 工程主义
晚点Auto· 2025-10-09 12:17
Core Viewpoint - The article discusses the innovative approach of NIO in the field of autonomous driving, emphasizing the importance of world models and reinforcement learning as key components for achieving advanced artificial general intelligence (AGI) in automotive technology [4][9][26]. Group 1: NIO's Approach to Autonomous Driving - NIO is positioning itself as an AI company, focusing on the development of autonomous driving technology through a unique combination of high computing power, multiple sensors, and a new architecture based on world models and reinforcement learning [5][8][34]. - The company has established a three-layer data system to support its autonomous driving capabilities, which is considered one of the most advanced in the industry [36][54]. - NIO's strategy involves a shift from traditional end-to-end models to a more complex world model that integrates spatial and temporal understanding, aiming to enhance the vehicle's ability to navigate real-world scenarios [10][13][26]. Group 2: Reinforcement Learning and World Models - Reinforcement learning is viewed as essential for developing long-term decision-making capabilities in autonomous systems, moving beyond short-term imitation learning [7][29][33]. - The world model is defined as a high-bandwidth cognitive system that allows AI to understand and predict physical interactions in the environment, which is crucial for effective autonomous driving [10][16][26]. - NIO believes that the integration of language models with world models will lead to a more comprehensive understanding of both concepts and physical realities, ultimately contributing to the development of AGI [13][28][33]. Group 3: Data Utilization and Training - NIO utilizes a combination of real-world driving data and simulated environments, including gaming data, to train its models, ensuring a robust understanding of various driving scenarios [27][30]. - The company emphasizes the importance of using large-scale, diverse datasets for training, as opposed to relying solely on expert data, which may lack the complexity of real-world situations [28][30]. - NIO's approach to data collection and training is designed to enhance the vehicle's performance in edge cases and improve overall safety [41][44]. Group 4: Future Developments and Industry Position - NIO plans to introduce an open-set interaction system that allows for more natural communication between users and the vehicle, moving beyond limited command sets [18][20]. - The company is committed to continuous innovation and exploration in the field of autonomous driving, even if it means facing initial skepticism from the industry [8][25][39]. - NIO's advancements in autonomous driving technology are expected to position it ahead of competitors, particularly with the upcoming release of its open-set interaction capabilities [22][47].