Workflow
强化学习
icon
Search documents
速递|ARR破5亿美元速度超Cursor,AI专家平台Mercor估值冲上100亿美元,融资3.5亿美元
Z Potentials· 2025-10-29 05:16
Core Insights - Mercor has successfully completed a $350 million financing round, raising its valuation to $10 billion [1] - The company initially started as an AI-driven recruitment platform but has pivoted to providing domain experts for AI model training [1][2] - Mercor's annual recurring revenue is projected to exceed $500 million, outpacing competitors [2] Financing and Valuation - Felicis Ventures led the previous $100 million Series B round at a $2 billion valuation and continues to lead the current round [1] - The company had previously set a target of $8 billion for its Series C round but has since increased it to $10 billion due to strong investor interest [1] Business Model and Operations - Mercor charges for talent recommendation and matching services based on hourly work from domain experts [1] - The company currently pays contractors over $1.5 million daily and has a talent pool of over 30,000 experts, with an average hourly income exceeding $85 [4] - The focus areas for Mercor include expanding its talent network, optimizing contractor-client matching systems, and developing new products for greater process automation [4] Market Context - The shift in partnerships among leading AI labs, such as OpenAI and Google DeepMind, has created opportunities for Mercor, especially after Scale AI lost significant contracts [2] - The rapid development of AI technology poses challenges in understanding the economic value of work, which Mercor aims to address [2]
单条演示即可抓取一切:北大团队突破通用抓取,适配所有灵巧手本体
量子位· 2025-10-29 05:11
Core Insights - The article discusses the challenges of traditional reinforcement learning (RL) in high-dimensional action spaces for robotic grasping tasks and introduces the DemoGrasp framework as a solution [1][2][4]. Group 1: DemoGrasp Framework - DemoGrasp is a simple and efficient learning method for general robotic grasping, initiated from a single successful demonstration trajectory [2][4]. - The framework transforms multi-step Markov Decision Processes (MDP) into a single-step MDP by editing demonstration trajectories, enhancing learning efficiency and performance transfer to real robots [4][7]. Group 2: Learning Process - The learning process involves editing the robot's actions in the demonstration trajectory to adapt to different objects and poses, focusing on wrist and finger adjustments [9][16]. - DemoGrasp utilizes a simulation environment with thousands of parallel worlds to train the policy network, which outputs editing parameters based on observations [10][11]. Group 3: Training Efficiency - The training efficiency is notable, with a single RTX 4090 GPU achieving over 90% success rate in just 24 hours on a compact action space [12]. - The framework can adapt to various robotic hands without adjusting training hyperparameters, achieving an average success rate of 84.6% across 175 objects [20]. Group 4: Performance Metrics - DemoGrasp outperforms existing methods in the DexGraspNet dataset, achieving a visual policy success rate of 92% with minimal generalization gap [17][18]. - In real-world tests, DemoGrasp successfully grasped 110 unseen objects, maintaining over 90% success rates for regular objects and 70% for challenging flat and small objects [21][22]. Group 5: Future Directions - The framework aims to support more complex tasks such as functional grasping and tool usage, with potential for real-time adjustments and error recovery in future research [25][26]. - DemoGrasp can integrate with multimodal large models for autonomous grasping in open environments [27].
AlphaGo之父找到创造强化学习算法新方法:让AI自己设计
机器之心· 2025-10-28 04:31
Core Insights - The article discusses a significant advancement in reinforcement learning (RL) where Google's DeepMind team has demonstrated that machines can autonomously discover state-of-the-art RL algorithms, outperforming human-designed rules [1][5]. Methodology - The research employs meta-learning based on the experiences of numerous agents in complex environments to discover RL rules [4][7]. - The team utilized two types of optimization: agent optimization and meta-optimization, allowing the agent to update its parameters to minimize the distance between its predictions and the targets set by a meta-network [7][19][22]. Experimental Results - The discovered RL rule, named DiscoRL, was evaluated using the Atari benchmark, achieving a normalized score of 13.86, surpassing all existing RL methods [26][29]. - Disco57, a variant of DiscoRL, demonstrated superior performance on previously unseen benchmarks, including ProcGen, indicating its strong generalization capabilities [33][34]. Generalization and Robustness - Disco57 showed robustness across various agent-specific settings and environments, achieving competitive results without using domain-specific knowledge [36][35]. - The research highlights the importance of diverse and complex environments for the discovery process, leading to stronger and more generalizable rules [39][40]. Efficiency and Scalability - The discovery process was efficient, requiring significantly fewer experiments compared to traditional methods, thus saving time and resources [40]. - The performance of the discovered rules improved with the number and diversity of environments used for discovery, indicating a scalable approach [40]. Qualitative and Information Analysis - Qualitative analysis revealed that the discovered predictions could identify significant events before they occurred, enhancing the learning process [45]. - Information analysis indicated that the discovered predictions contained unique information about upcoming rewards and strategies, which were not captured by traditional methods [46]. Emergence of Bootstrapping Mechanism - Evidence of a bootstrapping mechanism was found, where future predictions influenced current targets, demonstrating the interconnectedness of the learning process [47]. - The performance of the discovered rules was significantly impacted by the use of these predictions for strategy updates, emphasizing their importance in the learning framework [47]. Conclusion - This research marks a pivotal step towards machine-designed RL algorithms that can compete with or exceed the performance of human-designed algorithms in challenging environments [48].
为什么RL在人形/四足/机械臂等本体上依然还有很多工作可以做?
具身智能之心· 2025-10-28 04:00
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadruped robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][9] - To address these challenges, a specialized 1v6 mentoring course in RL has been launched, aimed at helping students produce quality research papers [6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enhancing their adaptability in various scenarios [2][8] - The integration of RL with Variable Length Action (VLA) in robotic arms is gaining popularity in academia, leading to more efficient and smooth robot operations [3][8] Group 2: Challenges in Learning and Research - The vast and intricate nature of RL makes it difficult for beginners to find a clear entry point, often resulting in frustration and abandonment of learning [5][9] - Producing a research paper that meets the standards of peer review requires proficiency in methodology, experimental results, and writing style, which can be overwhelming for newcomers [5][9] Group 3: Course Offerings and Structure - The 1v6 mentoring course is designed for graduate students and others seeking guidance on research papers, featuring small class sizes and weekly live sessions [7][9] - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of RL and its applications in robotics [9][15] - Participants will receive guidance on paper ideas, project implementation, experimental support, and writing refinement, with the goal of producing a draft suitable for submission to top conferences [7][9][15] Group 4: Course Content and Deliverables - The curriculum includes topics such as RL fundamentals, simulation environments, and specific applications in quadruped, humanoid, and robotic arm training [17][19] - Students will engage in hands-on projects, culminating in a research paper draft that adheres to the requirements of conferences like RAL, ICRA, IROS, and CoRL [23][24] - The course emphasizes a structured approach to research, covering the entire process from methodology to writing and submission [30]
刚刚,Thinking Machines Lab博客提出在策略蒸馏,Qwen被cue 38次
3 6 Ke· 2025-10-28 02:00
Core Insights - Thinking Machines Lab (TML) has introduced a new training method called on-policy distillation, which combines reinforcement learning (RL) error correlation with supervised fine-tuning (SFT) reward density, achieving superior performance at a lower cost [1][17]. Group 1: Methodology and Applications - On-policy distillation is effective for small models, enhancing their domain performance and continuous learning capabilities [1][17]. - The method is inspired by the Qwen team’s research and heavily utilizes the Qwen3 series models during experiments [3][34]. - The training process consists of three stages: pre-training, mid-training, and post-training, focusing on general capabilities, domain knowledge, and target behavior respectively [6][7]. Group 2: Advantages of On-Policy Distillation - Small models trained with on-policy distillation often outperform larger general models in specialized fields due to benefits like local deployment, easier continuous training, and reduced inference costs [7][17]. - The method provides dense reward signals, allowing for more efficient learning compared to traditional RL, which offers sparse feedback [9][18]. Group 3: Performance and Cost Efficiency - TML's experiments show that on-policy distillation can achieve performance comparable to RL at a fraction of the cost, with reported costs being only one-tenth of traditional RL methods [34][41]. - The method has demonstrated significant computational efficiency, requiring 7-10 times fewer gradient steps to achieve similar performance levels as RL [58]. Group 4: Continuous Learning and Personalization - On-policy distillation is positioned as a promising tool for continuous learning, allowing models to update without degrading previously learned behaviors [66][70]. - The approach can effectively personalize models, enabling them to adapt to specific tasks while retaining core capabilities [42][53].
Thinking Machine新研究刷屏!结合RL+微调优势,小模型训练更具性价比了
量子位· 2025-10-28 01:18
Core Insights - The article discusses the innovative research by Thinking Machine, focusing on a new training method for small language models called On-Policy Distillation, which enhances their understanding of specialized fields [1][4]. Summary by Sections Methodology - On-Policy Distillation combines the strengths of two traditional training methods: reinforcement learning (self-exploration) and supervised fine-tuning (direct answers), creating a more efficient training framework [3][8]. - This method allows AI to learn through practical problem-solving while receiving immediate guidance when it encounters difficulties, significantly improving training efficiency by 50-100 times [4][5]. Training Phases - The training process consists of three main phases: Pre-training (general capabilities), Mid-training (domain-specific knowledge), and Post-training (target behavior guidance) [9]. - The focus of the research is on the Post-training phase, where the model learns to perform specific tasks effectively [6][9]. Evaluation Metrics - The method employs Negative reverse KL divergence as a key evaluation metric, ensuring that the student model learns effectively by minimizing the divergence from the teacher model's expectations [12][15]. Experimental Results - Experiment 1 demonstrated that using On-Policy Distillation, a smaller model (8B) could achieve a performance score of 70% on a math benchmark with significantly lower computational costs compared to traditional methods [19][22]. - Experiment 2 showed that the method effectively mitigates "catastrophic forgetting" in AI models, allowing them to retain general capabilities while learning new knowledge [23][25]. Implications - The research indicates that On-Policy Distillation can empower resource-constrained individuals or small companies to train effective specialized models, enhancing accessibility in AI development [5][19]. - The findings suggest a promising avenue for achieving lifelong learning in AI systems, addressing the challenge of balancing new knowledge acquisition with the retention of existing skills [26].
刚刚,Thinking Machines Lab博客提出在策略蒸馏,Qwen被cue 38次
机器之心· 2025-10-28 00:41
Core Viewpoint - Thinking Machines Lab (TML) has introduced a new training method called on-policy distillation, which combines reinforcement learning (RL) error correlation with supervised fine-tuning (SFT) reward density, achieving superior performance at a lower cost compared to other methods [1][2][27]. Group 1: Methodology and Advantages - On-policy distillation allows small models to exhibit strong domain performance and continuous learning capabilities [1][2]. - The training process is divided into three stages: pre-training for general capabilities, mid-training for domain knowledge, and post-training for guiding target behaviors [6][7]. - On-policy training samples trajectories from the student model itself, providing direct feedback to avoid errors, while off-policy training relies on external sources [8][9][12]. Group 2: Comparison with Other Methods - On-policy distillation combines the advantages of on-policy training's reliability and the dense reward signals from SFT, making it a cost-effective alternative to traditional RL methods [28][92]. - In experiments, on-policy distillation achieved a score of 74.4% on the AIME'24 benchmark with significantly lower computational costs compared to RL, which required 17,920 GPU hours for a score of 67.6% [47][46]. Group 3: Applications and Future Directions - The method has been successfully applied to train models for mathematical reasoning and to develop assistant models with domain knowledge and instruction-following capabilities [26][27]. - TML aims to continue exploring new applications of on-policy distillation, improving teacher supervision methods, and enhancing data efficiency and continuous learning [92][93].
无人机也能打排球吗?清华团队用强化学习探了探路
具身智能之心· 2025-10-28 00:02
Core Insights - The article discusses a new embodied AI task proposed by Tsinghua University, focusing on "multi-drone volleyball," which aims to enhance the capabilities of drones in a three-dimensional space through teamwork and strategy [1][2]. Group 1: Task Overview - The "multi-drone volleyball" task requires drones to demonstrate high maneuverability and precise control while collaborating as a team to hit a ball over a net and compete against opposing teams [2]. - The Tsinghua team has developed the VolleyBots testing platform to simulate the human learning process in volleyball, incorporating various tasks for single and multiple drones [2][6]. Group 2: Algorithm Development - The Hierarchical Co-Self-Play (HCSP) algorithm was designed to enable drones to learn cooperation, division of roles, and offensive/defensive transitions through hierarchical strategy learning and self-play mechanisms [2][12]. - The research incorporated various reinforcement learning and game-theoretic algorithms, with the HCSP showing an average win rate of 82.9% against multiple baseline algorithms [15]. Group 3: Training Phases - The training process consists of three phases: low-level skill learning, high-level strategy game playing, and collaborative self-play, allowing drones to evolve their strategies and skills in a competitive environment [14]. - The drones demonstrated the ability to form clear roles during matches, such as defense, passing, and offense, and even developed new tactics like "setter's lob" during training [15]. Group 4: Real-World Application - The JuggleRL system was introduced to enable drones to perform continuous juggling in the real world, achieving a record of 462 consecutive juggles without any real data fine-tuning [16][18]. - This achievement marks a significant step in embodied reinforcement learning, transitioning from virtual environments to real physical interactions [18][19].
正式结课!工业界大佬带队三个月搞定端到端自动驾驶
自动驾驶之心· 2025-10-27 00:03
Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry is witnessing rapid development in end-to-end methods, particularly the one-stage approach exemplified by UniAD, which directly models vehicle trajectories from sensor inputs [1][3]. - There are two main paradigms in the industry: one-stage and two-stage methods, with the one-stage approach gaining traction and leading to various derivatives based on perception, world models, diffusion models, and VLA [3][5]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, focusing on cutting-edge algorithms in both one-stage and two-stage end-to-end methods, aimed at bridging academic and industrial advancements [5][15]. - The course is structured into several chapters, covering the history and evolution of end-to-end methods, background knowledge on VLA, and detailed discussions on both one-stage and two-stage approaches [9][10][12]. Group 3: Key Technologies - The course emphasizes critical technologies such as BEV perception, visual language models (VLM), diffusion models, and reinforcement learning, which are essential for mastering the latest advancements in autonomous driving [5][11][19]. - The second chapter of the course is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [10]. Group 4: Practical Applications - The course includes practical assignments, such as RLHF fine-tuning, allowing participants to apply their knowledge in real-world scenarios and understand how to build and experiment with pre-trained and reinforcement learning modules [13][19]. - The curriculum also covers various subfields of one-stage end-to-end methods, including those based on perception, world models, diffusion models, and VLA, providing a comprehensive understanding of the current landscape in autonomous driving technology [14][19].
HuggingFace联合牛津大学新教程开源SOTA资源库!
具身智能之心· 2025-10-27 00:02
Core Viewpoint - The article emphasizes the significant advancements in robotics, particularly in robot learning, driven by the development of large models and multi-modal AI technologies, which have transformed traditional robotics into a more learning-based paradigm [3][4]. Group 1: Introduction to Robot Learning - The article introduces a comprehensive tutorial on modern robot learning, covering foundational principles of reinforcement learning and imitation learning, leading to the development of general-purpose, language-conditioned models [4][12]. - HuggingFace and Oxford University researchers have created a valuable resource for newcomers to the field, providing an accessible guide to robot learning [3][4]. Group 2: Classic Robotics - Classic robotics relies on explicit modeling through kinematics and control planning, while learning-based methods utilize deep reinforcement learning and expert demonstration for implicit modeling [15]. - Traditional robotic systems follow a modular pipeline, including perception, state estimation, planning, and control [16]. Group 3: Learning-Based Robotics - Learning-based robotics integrates perception and control more closely, adapts to tasks and entities, and reduces the need for expert modeling [26]. - The tutorial highlights the challenges of safety and efficiency in real-world applications, particularly during the initial training phases, and discusses advanced techniques like simulation training and domain randomization to mitigate risks [34][35]. Group 4: Reinforcement Learning - Reinforcement learning allows robots to autonomously learn optimal behavior strategies through trial and error, showcasing significant potential in various scenarios [28]. - The tutorial discusses the complexity of integrating multiple system components and the limitations of traditional physics-based models, which often oversimplify real-world phenomena [30]. Group 5: Imitation Learning - Imitation learning offers a more direct learning path for robots by replicating expert actions through behavior cloning, avoiding complex reward function designs [41]. - The tutorial addresses challenges such as compound errors and handling multi-modal behaviors in expert demonstrations [41][42]. Group 6: Advanced Techniques in Imitation Learning - The article introduces advanced imitation learning methods based on generative models, such as Action Chunking with Transformers (ACT) and Diffusion Policy, which effectively model multi-modal data [43][45]. - Diffusion Policy demonstrates strong performance in various tasks with minimal demonstration data, requiring only 50-150 demonstrations for training [45]. Group 7: General Robot Policies - The tutorial envisions the development of general robot policies capable of operating across tasks and devices, inspired by large-scale open robot datasets and powerful visual-language models [52][53]. - Two cutting-edge visual-language-action (VLA) models, π₀ and SmolVLA, are highlighted for their ability to understand visual and language instructions and generate precise control commands [53][56]. Group 8: Model Efficiency - SmolVLA represents a trend towards model miniaturization and open-sourcing, achieving high performance with significantly reduced parameter counts and memory consumption compared to π₀ [56][58].