Workflow
强化学习
icon
Search documents
00后投身具身智能创业,剑指机器人界「Model 3」!已推出21个自由度灵巧手
量子位· 2025-06-22 04:46
Core Viewpoint - Lingchu Intelligent has launched a self-developed dexterous hand with 21 degrees of freedom, aiming to achieve high precision in operations that surpass common 6-degree-of-freedom grippers [1][3][2]. Group 1: Product Features and Goals - The dexterous hand supports 16 active degrees of freedom, enabling complex tasks such as gripping, rotating, and precise insertion [1][2]. - The company aims to reduce the price of a complete robot system to $10,000 (approximately 71,885 yuan), similar to Tesla's Model 3 pricing strategy [3][29]. - Lingchu's humanoid robot design features a "wheeled + two-handed" structure, moving away from traditional gripper designs [4][6]. Group 2: Technical Challenges and Innovations - The complexity of achieving 21 degrees of freedom presents significant manufacturing and stability challenges, making it difficult to lower production costs [3][10]. - Lingchu emphasizes the need for higher dexterity and freedom in robotic hands to perform tasks like tool usage and precision assembly, which are not achievable with simple grippers [8][10]. - The company has adopted a layered end-to-end architecture for its algorithms, combining fast and slow processing to enhance task execution and decision-making [22][23]. Group 3: Market Strategy and Vision - Lingchu's strategy is to integrate hardware and software deeply, defining the user experience at a system level rather than selling individual components [27][26]. - The company aims to establish a complete product ecosystem that includes the robot, action systems, data, and task delivery, positioning itself for scalability in the market [28][29]. - Lingchu is working towards making humanoid robots commercially viable, similar to how Tesla's Model 3 transformed the electric vehicle market [36][35].
从RLHF、PPO到GRPO再训练推理模型,这是你需要的强化学习入门指南
机器之心· 2025-06-22 04:26
Core Insights - Reinforcement Learning (RL) has become an essential technology in the AI field, particularly in large language models (LLMs) [1] - The Unsloth team has released a comprehensive reinforcement learning tutorial that covers various concepts from RLHF to GRPO, making it accessible for beginners and advanced users alike [2][3] Group 1: Understanding Reinforcement Learning - The goal of reinforcement learning is to increase the likelihood of achieving "good" outcomes while reducing the chances of "bad" outcomes [8][10] - Key components of RL include the environment, agent, actions, and reward functions, which collectively define the learning process [9][14] - RLHF (Reinforcement Learning from Human Feedback) has gained popularity, particularly through OpenAI's implementation, which trains agents to generate outputs deemed useful by humans [16][19] Group 2: GRPO and Its Advantages - GRPO (Group Relative Policy Optimization) is a method developed to train reasoning models, differing from PPO (Proximal Policy Optimization) by removing the value model and utilizing custom reward functions [22][24] - GRPO estimates average rewards through sampling multiple outputs for a given question, which helps in optimizing the model's performance [27][28] - The approach allows for significant memory savings and can enhance various tasks beyond coding and mathematics, such as email automation and legal applications [30] Group 3: Training with Unsloth - Unsloth provides a detailed guide for training reasoning models using GRPO, requiring a minimum of 5GB VRAM for local training of models up to 1.5 billion parameters [44] - The training process involves generating multiple answer variants for each question, evaluating them with a reward function, and updating model weights accordingly [45][57] - Effective training requires a well-designed reward function and a sufficient amount of data, with recommendations for at least 500 lines for optimal results [49][50] Group 4: Reward Functions and Validators - Reward functions and validators play crucial roles in evaluating model outputs, with the former assigning scores based on correctness and quality, while the latter verifies the accuracy of the outputs [46][56] - Examples of reward functions include those that reward correct answers and penalize incorrect or overly verbose responses [61] - The design of reward functions is critical, as poorly constructed ones can inadvertently degrade model performance [57]
VR-Robo:real2sim2real,机器人视觉强化学习导航和运动控制新范式!
具身智能之心· 2025-06-20 00:44
Core Viewpoint - The article discusses the advancements in footed robot navigation and motion control through a unified framework called VR-Robo, which addresses the challenges of transferring learned strategies from simulation to real-world applications [3][16]. Related Work - Previous research has explored various methods to bridge the Sim-to-Real gap, but many rely on specific sensors and struggle to balance high-fidelity rendering with real geometric modeling [3][4]. Solution - The VR-Robo framework combines geometric priors from images to reconstruct consistent scenes, utilizes GS-mesh hybrid representation for creating interactive simulation environments, and employs neural reconstruction methods like NeRF for generating high-fidelity scene images [4][5][16]. Experimental Analysis - Comparative experiments were conducted against baseline methods, including imitation learning and textured mesh approaches, to evaluate the performance of the VR-Robo framework [11][12]. - Performance metrics reported include Success Rate (SR) and Average Reaching Time (ART), demonstrating VR-Robo's superior performance in various difficulty levels [14][15]. Summary and Limitations - VR-Robo successfully trains visual navigation strategies using only RGB images, enabling autonomous navigation in complex environments without additional sensors. However, it currently only applies to static indoor environments and has limitations in training efficiency and structural accuracy of the reconstructed meshes [16].
小鹏想要的,不止“留在牌桌上”
虎嗅APP· 2025-06-19 23:55
Core Viewpoint - The article discusses the significant growth and strategic positioning of two electric vehicle manufacturers, Xiaopeng and Leap Motor, highlighting their sales performance, product strategies, and marketing approaches in a competitive market. Group 1: Sales Performance - In the first five months of the year, both Xiaopeng and Leap Motor maintained rapid growth, with Leap Motor's sales increasing by 161% year-on-year and Xiaopeng's by 293% [3][4] - Both companies reported substantial revenue growth in Q1, with Leap Motor's revenue up 187% and Xiaopeng's up 142% year-on-year [4] - Net losses for Leap Motor shrank by 87% and for Xiaopeng by 52%, indicating improved financial health [4] Group 2: Product Strategy - Xiaopeng's rebound in sales is attributed to the successful launch of the MONA M03 model, which has become a best-seller, accounting for over 50% of Xiaopeng's monthly sales in several months [7] - The MONA M03 is positioned as a cost-effective option, featuring a CLTC range of 620 kilometers, which alleviates range anxiety for consumers [7][12] - The vehicle includes user-friendly features such as smart parking and enhanced comfort, appealing to a younger demographic [12][14] Group 3: Marketing and Branding - Xiaopeng has adopted an aggressive marketing strategy, including multiple product launches and media events to increase brand visibility [4][6] - The company has successfully attracted a significant female consumer base, with female users accounting for 50% of MONA M03 orders, a notable increase from the market average [16][14] - Xiaopeng's marketing events have been designed to resonate with younger consumers, incorporating engaging elements and celebrity endorsements [16][18] Group 4: Technological Advancements - Xiaopeng is focusing on technological innovation, with the introduction of the self-developed "Turing AI chip" aimed at enhancing autonomous driving capabilities [20][21] - The company is leveraging large-scale models and reinforcement learning to improve its autonomous driving technology, showcasing its commitment to advancing AI in vehicles [28][30] - Xiaopeng's AI team has validated the effectiveness of scaling laws in autonomous driving, indicating a strategic approach to enhancing vehicle intelligence [28][29]
小鹏想要的,不止“留在牌桌上”
Hu Xiu· 2025-06-19 23:13
Core Insights - Both Leapmotor and Xpeng have significantly increased their sales, with Leapmotor growing 161% and Xpeng 293% year-on-year from January to May. Their Q1 revenues also saw substantial growth, with Leapmotor up 187% and Xpeng up 142%. Net losses were reduced significantly, with Leapmotor's loss shrinking by 87% and Xpeng's by 52% [2] - Xpeng's proactive marketing and product launch strategy contrasts with Leapmotor's more reserved approach, indicating a different mindset in responding to market opportunities [2] - Xpeng's recent product, the MONA M03, has been a key driver of its sales rebound, accounting for over 50% of monthly sales since its launch [7][12] Sales and Marketing Strategy - Xpeng's marketing strategy includes extensive media engagement and product launch events, such as the recent X9 launch in Hong Kong, which attracted nearly 500 media representatives [3][4] - The company has focused on creating a strong brand presence through various promotional activities, including events targeting actual car owners [2][3] - The MONA M03's competitive pricing and features, such as a 620 km range, have made it appealing to consumers, particularly in addressing range anxiety [9][8] Product Development and Features - The MONA M03 has been designed with a focus on user needs, balancing cost control with essential features, which has resonated well with consumers [8][12] - The vehicle includes enhancements like electric tailgates and smart parking, while also simplifying certain features to reduce costs [10][11] - Xpeng's product team demonstrated efficiency in refining the MONA model within a short timeframe after acquiring it from Didi [12] Consumer Demographics and Feedback - The MONA M03 has attracted a notably high percentage of female consumers, with 38.6% of users being women, which is significantly above the industry average [18][19] - Feedback from female users highlights the vehicle's aesthetics and practical features, contributing to its popularity among this demographic [20][21] - Xpeng has quickly adapted to market feedback by introducing new interior options that appeal to female consumers, further boosting sales [21][25] Technological Advancements - Xpeng is focusing on technological innovation, particularly with its self-developed "Turing AI chip," which will enhance the capabilities of its vehicles, including the upcoming G7 model [27][30] - The G7 will feature advanced computing power, significantly exceeding that of competitors, which is part of Xpeng's strategy to differentiate itself in the market [30][31] - The company is also exploring the application of scaling laws in AI to improve autonomous driving capabilities, indicating a commitment to ongoing technological development [40][42] Future Outlook - Xpeng's CEO has emphasized the importance of building a robust system rather than relying solely on individual product successes, indicating a long-term vision for the company [26][51] - The company aims to maintain its focus on technological advancements and market responsiveness to ensure its competitive position in the automotive industry [51]
羽毛球机器人如何“看得清”“动得准”?(创新汇)
Ren Min Ri Bao· 2025-06-19 21:51
Group 1 - A new bipedal robot developed by the Swiss Federal Institute of Technology in Zurich can predict the trajectory of a badminton shuttlecock and adjust its position to hit it back to a human opponent, showcasing advanced perception and coordination capabilities [2][3] - The robot's ability to track the shuttlecock relies on a perception noise model that quantifies the impact of its own movements on target tracking, allowing it to adapt to dynamic blurs and occlusions [3][4] - The robot can perform 10 consecutive hits in a single rally with nearly 100% success rate for shots landing in the center of the court, demonstrating its effective coordination of 18 joints through a unified control framework [3][4] Group 2 - The average time for the robot to react from detecting an opponent's hit to executing a swing is approximately 0.35 seconds, indicating room for improvement in its perception and response capabilities [4] - Future enhancements will involve integrating more sensors and optimizing visual algorithms, aiming to extend the robot's applications beyond sports to complex scenarios requiring rapid response and coordination [4] - Bipedal robots are expected to gain traction in various fields such as industrial applications, entertainment, home life, and elder care, driven by advancements in AI and robotics, leading to lower production costs and enhanced functionalities [5]
推荐大模型来了?OneRec论文解读:端到端训练如何同时吃掉效果与成本
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the transformation of recommendation systems through the integration of large language models (LLMs), highlighting the introduction of the "OneRec" system by Kuaishou, which aims to enhance efficiency and effectiveness in recommendation processes [2][35]. Group 1: Challenges in Traditional Recommendation Systems - Traditional recommendation systems face significant challenges, including low computational efficiency, conflicting optimization objectives, and an inability to leverage the latest AI advancements [5]. - For instance, Kuaishou's SIM model shows a Model FLOPs Utilization (MFU) of only 4.6%/11.2%, which is significantly lower than LLMs that achieve 40%-50% [5][28]. Group 2: Introduction of OneRec - OneRec is an end-to-end generative recommendation system that utilizes an Encoder-Decoder architecture to model user behavior and enhance recommendation accuracy [6][11]. - The system has demonstrated a tenfold increase in effective computational capacity and improved MFU to 23.7%/28.8%, significantly reducing operational costs to just 10.6% of traditional methods [8][31]. Group 3: Performance Improvements - OneRec has shown substantial performance improvements in user engagement metrics, achieving a 0.54%/1.24% increase in app usage duration and a 0.05%/0.08% growth in the 7-day user lifecycle (LT7) [33]. - In local life service scenarios, OneRec has driven a 21.01% increase in GMV and an 18.58% rise in the number of purchasing users [34]. Group 4: Technical Innovations - The system employs a multi-modal fusion approach, integrating various data types such as video titles, tags, and user behavior to enhance recommendation quality [14]. - OneRec's architecture allows for significant computational optimizations, including a 92% reduction in the number of key operators, which enhances overall efficiency [27][28]. Group 5: Future Directions - Kuaishou's technical team identifies areas for further improvement, including enhancing inference capabilities, developing a more integrated multi-modal architecture, and refining the reward system to better align with user preferences [38].
从 OpenAI 回清华,吴翼揭秘强化学习之路:随机选的、笑谈“当年不懂股权的我” | AGI 技术 50 人
AI科技大本营· 2025-06-19 01:41
Core Viewpoint - The article highlights the journey of Wu Yi, a prominent figure in the AI field, emphasizing his contributions to reinforcement learning and the development of open-source systems like AReaL, which aims to enhance reasoning capabilities in AI models [1][6][19]. Group 1: Wu Yi's Background and Career - Wu Yi, born in 1992, excelled in computer science competitions and was mentored by renowned professors at Tsinghua University and UC Berkeley, leading to significant internships at Microsoft and Facebook [2][4]. - After completing his PhD at UC Berkeley, Wu joined OpenAI, where he contributed to notable projects, including the "multi-agent hide-and-seek" experiment, which showcased complex behaviors emerging from simple rules [4][5]. - In 2020, Wu returned to China to teach at Tsinghua University, focusing on integrating cutting-edge technology into education and research while exploring industrial applications [5][6]. Group 2: AReaL and Reinforcement Learning - AReaL, developed in collaboration with Ant Group, is an open-source reinforcement learning framework designed to enhance reasoning models, providing efficient and reusable training solutions [6][19]. - The framework addresses the need for models to "think" before generating answers, a concept that has gained traction in recent AI developments [19][20]. - AReaL differs from traditional RLHF (Reinforcement Learning from Human Feedback) by focusing on improving the intelligence of models rather than merely making them compliant with human expectations [21][22]. Group 3: Challenges in AI Development - Wu Yi discusses the significant challenges in entrepreneurship within the AI sector, emphasizing the critical nature of timing and the risks associated with missing key opportunities [12][13]. - The evolution of model sizes presents new challenges for reinforcement learning, as modern models can have billions of parameters, necessitating adaptations in training and inference processes [23][24]. - The article also highlights the importance of data quality and system efficiency in training reinforcement learning models, asserting that these factors are more critical than algorithmic advancements [30][32]. Group 4: Future Directions in AI - Wu Yi expresses optimism about future breakthroughs in AI, particularly in areas like memory expression and personalization, which remain underexplored [40][41]. - The article suggests that while multi-agent systems are valuable, they may not be essential for all tasks, as advancements in single models could render multi-agent approaches unnecessary [42][43]. - The ongoing pursuit of scaling laws in AI development indicates that improvements in model performance will continue to be a focal point for researchers and developers [26][41].
【广发金工】强化学习与价格择时
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in quantitative investment, particularly in developing timing strategies that can maximize cumulative returns through trial and error learning mechanisms [1][2]. Summary by Sections 1. Introduction to Reinforcement Learning - Reinforcement Learning (RL) is a machine learning method that enables decision-making systems to learn optimal actions in specific situations to maximize cumulative rewards. This method is particularly suitable for environments with clear goals but no direct guidance on achieving them [6][12]. 2. Timing Strategy - The article focuses on the Double Deep Q-Network (DDQN) model, which uses 10-minute frequency price and volume data as input. The goal is for the model to learn to provide buy/sell/hold signals at various time points to maximize end-period returns. The backtesting phase outputs timing signals every 10 minutes, adhering to a t+1 trading rule [2][3]. 3. Empirical Analysis - The strategy was tested on various liquid ETFs and stocks from January 1, 2023, to May 31, 2025. The results showed that the strategy generated 72, 30, 73, and 188 timing signals for different assets, with average win rates of 52.8%, 53.3%, 54.8%, and 51.6%, respectively. Cumulative returns outperformed benchmark assets by 10.9%, 35.5%, 64.9%, and 37.8% [3][74][80]. 4. Summary and Outlook - Despite the impressive performance of RL in various fields, challenges such as stability issues remain in the quantitative investment domain. Future reports will explore more RL algorithms to develop superior strategies [5]. 5. Data Description - The timing strategy was applied to the CSI 300 Index, CSI 500 Index, CSI 1000 Index, and a specific stock, utilizing liquid ETFs corresponding to these indices. The training data spanned from January 1, 2014, to December 31, 2019, with validation and testing periods defined [74][75]. 6. Performance Metrics - The performance metrics for the RL timing strategy included total returns, annualized returns, maximum drawdown, annualized volatility, Sharpe ratio, information ratio, and return-to-drawdown ratio, demonstrating the strategy's effectiveness compared to benchmark assets [77][80].
MiniMax开源首个推理模型,456B参数,性能超DeepSeek-R1,技术报告公开
3 6 Ke· 2025-06-17 08:15
Core Insights - MiniMax has launched the world's first open-source large-scale hybrid architecture inference model, MiniMax-M1, with a five-day continuous update plan [2] Model Specifications - The M1 model has a parameter scale of 456 billion, activating 45.9 billion parameters per token, supporting 1 million context inputs and the longest 80,000 token inference output in the industry, which is 8 times that of DeepSeek-R1 [4] - Two versions of the MiniMax-M1 model were trained with thinking budgets of 40k and 80k [4] Training and Cost - The training utilized 512 H800 units over three weeks, costing approximately $537,400 (around 3.859 million RMB), which is an order of magnitude lower than initial cost expectations [7] - The M1 model is available for unlimited free use on the MiniMax app and web [7] API Pricing Structure - The API pricing for M1 is tiered based on input length: - 0-32k input: 0.8 RMB/million tokens input, 8 RMB/million tokens output - 32k-128k input: 1.2 RMB/million tokens input, 16 RMB/million tokens output - 128k-1M input: 2.4 RMB/million tokens input, 24 RMB/million tokens output [7][11] - Compared to DeepSeek-R1, M1's first tier input price is 80% and output price is 50% of DeepSeek-R1's, while the second tier input price is 1.2 times higher [9] Performance Evaluation - MiniMax-M1 outperforms other models like DeepSeek-R1 and Qwen3-235B in complex software engineering, tool usage, and long context tasks [13][14] - In the MRCR test, M1's performance is slightly lower than Gemini 2.5 Pro but better than other models [13] - In the SWE-bench Verified test set, M1-40k and M1-80k perform slightly worse than DeepSeek-R1-0528 but better than other open-source models [14] Technical Innovations - M1 employs a mixed expert (MoE) architecture and a lightning attention mechanism, allowing efficient scaling for long input and complex tasks [16] - The model utilizes large-scale reinforcement learning (RL) for training, with a new CISPO algorithm that enhances performance by optimizing importance sampling weights [16][17] Future Directions - MiniMax emphasizes the need for "Language-Rich Mediator" agents to handle complex scenarios requiring dynamic resource allocation and multi-round reasoning [19]