Workflow
Reinforcement Learning (RL)
icon
Search documents
首个文本到3D生成RL范式诞生,攻克几何与物理合理性
具身智能之心· 2025-12-20 16:03
论文链接: https://arxiv.org/pdf/2512.10949 代码链接: https://github.com/Ivan-Tang-3D/3DGen-R1 强化学习是否能够用于Text-to-3D生成,以加强3D自回归模型的逐步推理与生成过程? 点击下方 卡片 ,关注" 具身智能之心 "公众号 编辑丨 量子位 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在大语言模型和文生图领域,强化学习 (RL) 已成为提升模型思维链与生成质量的关键方法。 但当我们将目光转向更为复杂的文本到3D生成时,这套方法还会还管用吗? 近期,一项由 西北工业大学、北京大学、香港中文大学、上海人工智能实验室、香港科技大学合作 开展 的研究系统性探索了这一重要问 题。 在LLM推理和2D文生图中,RL已经证明可以显著提升CoT推理能力和生成质量。但 3D物体更长、更稠密、更具几何约束 。 因此相关方向研究常面临这几个问题: 1. 奖励如何同时刻画语义对齐、几何一致性和视觉质量? 2. 现有RL算法是否适合自回归式 ...
准确率腰斩,大模型视觉能力一出日常生活就「失灵」
3 6 Ke· 2025-12-09 06:59
Core Insights - The EgoCross project focuses on evaluating cross-domain first-person video question answering, revealing the limitations of existing MLLMs in specialized fields such as surgery, industry, extreme sports, and animal perspectives [1][3][4] Group 1: Project Overview - EgoCross is the first cross-domain EgocentricQA benchmark, covering four high-value professional fields and containing nearly 1,000 high-quality QA pairs [3][9] - The project provides both closed (CloseQA) and open (OpenQA) evaluation formats, addressing a significant gap in the assessment of models in these specialized areas [3][9] Group 2: Model Evaluation - Eight mainstream MLLMs were tested, revealing that even the best-performing models had a CloseQA accuracy below 55% and OpenQA accuracy below 35% in cross-domain scenarios [4][9] - The study found that reinforcement learning (RL) methods could significantly improve performance, with an average increase of 22% in accuracy [10][16] Group 3: Task and Domain Challenges - The research highlights the significant domain shift between everyday activities and specialized fields, with models performing well in daily tasks but struggling in professional contexts [8][9] - The study identified that prediction tasks showed a more severe decline in performance compared to basic identification tasks [13][16] Group 4: Improvement Strategies - Three improvement methods were explored: prompt learning, supervised fine-tuning (SFT), and reinforcement learning (RL), with RL showing the most substantial performance gains [15][16] - The findings suggest that current models have limitations in generalization, indicating a need for further development to create more capable multimodal systems [16]
地平线RAD:基于3DGS 大规模强化学习的端到端驾驶策略
自动驾驶之心· 2025-11-29 02:06
Core Insights - The article discusses a novel approach to reinforcement learning (RL) for end-to-end (e2e) policy development in autonomous driving, utilizing 3D Graphics Simulation (3DGS) to enhance training environments [1][2] - The proposed method significantly reduces collision rates, achieving a threefold decrease compared to pure imitation learning (IL) [1] - Limitations of the 3DGS environment include a lack of interaction, reliance on log replay, and inadequate rendering of non-rigid pedestrians and low-light scenarios [1] Summary by Sections Methodology - The approach consists of three main phases: training a basic Bird's Eye View (BEV) and perception model, freezing perception to train a planning head using IL, and generating a sensor-level environment with 3DGS for mixed training of RL and IL [3][5][6] - The training process involves pre-training perception models, followed by IL training on human expert data, and finally fine-tuning with RL to enhance sensitivity to critical risk scenarios [10][12] State and Action Space - The state space includes various encoders for BEV features, static map elements, traffic participant information, and planning-related features [7] - The action space is defined with discrete movements for lateral and longitudinal actions, allowing for a total of 61 actions in both dimensions [8] Reward Function - The reward function is designed to penalize collisions and deviations from expert trajectories, with specific thresholds for dynamic and static collisions, as well as positional and heading deviations [17][19] - Auxiliary tasks are introduced to stabilize training and accelerate convergence, focusing on behaviors like deceleration and acceleration [20][23] Experimental Results - The results indicate that the proposed method outperforms other IL-based algorithms, demonstrating the advantages of closed-loop training in dynamic environments [28][29] - The optimal ratio of RL to IL data is found to be 4:1, contributing to improved performance metrics [28] Conclusion - The article emphasizes the practical engineering improvements achieved through the integration of 3DGS in training environments, leading to better performance in autonomous driving applications [1][2]
Ilya Sutskever 重磅3万字访谈:AI告别规模化时代,回归“研究时代”的本质
创业邦· 2025-11-27 03:51
Core Insights - The AI industry is transitioning from a "Scaling Era" back to a "Research Era," emphasizing fundamental innovation over mere model size expansion [4][7][40]. - Current AI models exhibit high performance in evaluations but lack true generalization capabilities, akin to students who excel in tests without deep understanding [10][25]. - SSI's strategy focuses on developing safe superintelligence without commercial pressures, aiming for a more profound understanding of AI's alignment with human values [15][16]. Group 1: Transition from Scaling to Research - The period from 2012 to 2020 was characterized as a "Research Era," while 2020 to 2025 is seen as a "Scaling Era," with a return to research now that computational power has significantly increased [4][7][40]. - Ilya Sutskever argues that simply scaling models will not yield further breakthroughs, as the data and resources are finite, necessitating new learning paradigms [7][39]. Group 2: Limitations of Current Models - Current models are compared to students who have practiced extensively but lack the intuitive understanding of true experts, leading to poor performance in novel situations [10][25]. - The reliance on pre-training and reinforcement learning has resulted in models that excel in benchmarks but struggle with real-world complexities, often introducing new errors while attempting to fix existing ones [20][21]. Group 3: Pursuit of Superintelligence - SSI aims to avoid the "rat race" of commercial competition, focusing instead on building a safe superintelligence that can care for sentient life [15][16]. - Ilya emphasizes the importance of a value function in AI, akin to human emotions, which guides decision-making and learning efficiency [32][35]. Group 4: Future Directions and Economic Impact - The future of AI is predicted to be marked by explosive economic growth once continuous learning challenges are overcome, leading to a diverse ecosystem of specialized AI companies [16][18]. - Ilya suggests that human roles may evolve to integrate with AI, maintaining balance in a world dominated by superintelligent systems [16][18].
Ilya两万字最新访谈:人类的情感并非累赘,而是 AI 缺失的“终极算法”
3 6 Ke· 2025-11-26 04:26
Core Insights - The discussion centers on the limitations of current AI models and the new pathways toward superintelligence, emphasizing the disconnect between model performance in evaluations and real-world applications [3][4][20] - Ilya Sutskever highlights the need to transition back to a research-focused paradigm, moving away from mere scaling of models, as the diminishing returns of scaling become evident [3][34] - The concept of a "value function" is introduced as a critical element that enables human-like learning efficiency, which current AI lacks [3][5][6] Group 1: Current AI Limitations - Current AI models perform well in evaluation tests but often make basic errors in practical applications, indicating a lack of true understanding and generalization [4][18][20] - The over-optimization of reinforcement learning (RL) for evaluations has led to models that excel in competitive programming but struggle with real-world problem-solving [4][21] - Sutskever compares AI models to competitive programmers who are skilled in solving specific problems but lack the broader intuition and creativity of more versatile learners [4][22] Group 2: Human Learning Insights - Human learning is characterized by high sample efficiency, allowing individuals to learn complex skills with minimal data, attributed to innate value functions that guide decision-making [5][6][40] - The evolutionary advantages in human learning, particularly in areas like vision and motor skills, suggest that humans possess superior learning algorithms compared to current AI systems [5][38] - The discussion emphasizes the importance of emotional and intuitive feedback in human learning, which AI currently lacks [6][30][31] Group 3: Strategic Directions for SSI - Ilya Sutskever's new company, SSI, aims to explore safe superintelligence, advocating for a gradual release of AI capabilities to raise public awareness about safety [7][52] - The shift from a secretive development approach to a more transparent, gradual release strategy is seen as essential for fostering a collaborative safety environment [7][52] - SSI's focus on research over immediate market competition is intended to prioritize safety and ethical considerations in AI development [52][54] Group 4: Research Paradigm Shift - The transition from an era of scaling (2020-2025) back to a research-focused approach is necessary as the limits of scaling become apparent [34][46] - Sutskever argues that while scaling has been beneficial, it has also led to a homogenization of ideas, necessitating a return to innovative research [34][46] - The need for a more efficient use of computational resources in research is highlighted, suggesting that breakthroughs may come from novel approaches rather than sheer scale [35][46]
Z Event|NeurIPS 2025 活动专场:RL x Agent ,给 AGI 的 2026 写下最后预言
Z Potentials· 2025-11-25 03:28
Core Insights - The article emphasizes the growing importance of Reinforcement Learning (RL) and Agents in the context of large models, highlighting a shift from merely generating text to enabling models to perform actions through decision-making processes [1][2]. Group 1: Event Overview - The NeurIPS 2025 event aims to create a relaxed environment for researchers and engineers from leading organizations like OpenAI, DeepMind, and Meta FAIR to discuss RL, decision-making, and the underlying capabilities of large models [1]. - The event will not feature formal presentations but will encourage informal discussions about technology, ideas, and experiences, fostering a collaborative atmosphere [1]. Group 2: Focus on RL and Agents - There is a renewed focus on RL, moving beyond traditional fine-tuning methods to enable models to strengthen through interaction with the environment [2]. - The development of executable Agents requires a robust Action Layer, which is essential for models to perform tasks effectively [2][3]. Group 3: Industry Developments - Platforms like Composio are emerging to build the next generation of AI Agents by creating an Action Layer that integrates various tools and APIs into a unified interface, highlighting the infrastructure needed for operational Agents [3]. - Investment in AI infrastructure is being driven by funds like Hattrick Capital, which have been early supporters of AI advancements, particularly in the areas of Agents and robotics [4].
从 AI 创业角度看 GEO:如何引流、效果评估,以及创业机会在哪里?
Founder Park· 2025-08-10 01:33
Core Insights - GEO (Generative Engine Optimization) is not a completely new concept but rather an evolution of SEO in the era of AI search and LLMs [2][4] - There is ongoing debate about the potential of GEO as a significant business opportunity, with some viewing it as a new frontier while others see it as merely an extension of SEO [4][5] - The article emphasizes the importance of understanding GEO's principles, strategies for content optimization, and monitoring effectiveness [5] Group 1: Understanding GEO - GEO is fundamentally about optimizing content for AI retrieval and summarization, focusing on making content easily accessible and understandable for AI systems [10][30] - The shift from traditional SEO to GEO involves changes in how content is ranked and made visible, with LLMs generating structured responses that complicate traditional ranking methods [9][14] - Effective GEO strategies include content optimization, evaluation metrics, and conducting commercial GEO experiments [9][10] Group 2: Content Optimization Strategies - RAG (Retrieval-Augmented Generation) workflows are essential for GEO, emphasizing the need for clear structure and readability in content [19][20] - Content should be designed to be easily retrievable and quotable, with a focus on clarity and reducing ambiguity in expression [21][22] - Strategies for enhancing content visibility include using specific terminology, avoiding vague references, and employing structured data formats like Schema.org [27][28] Group 3: Agent Optimization Strategies - AEO (Agentic Engine Optimization) is a subset of GEO, focusing on optimizing content for agent-based interactions [30] - Content should be task-oriented and contextually rich to facilitate agent understanding and action [31][32] - Clear definitions and user-friendly documentation are crucial for enhancing agent interactions and ensuring effective task completion [33][34] Group 4: Practical Implementation of GEO - A closed-loop process of content creation, exposure, retention, and optimization is vital for successful GEO [36] - Establishing authority signals (E-E-A-T) is important for building trust with AI systems, which prefer credible and expert sources [37] - Continuous content updates and engagement with external authoritative sources can enhance visibility and credibility in AI-driven environments [38][39] Group 5: Measuring GEO Effectiveness - Evaluating the visibility and citation of content across AI search platforms is essential for understanding its impact [39][40] - Various methods, such as SERP detection and AI citation monitoring, can be employed to assess content performance [40][41] - Analyzing user behavior and conversion rates from AI-driven traffic can provide insights into the effectiveness of GEO strategies [44][46] Group 6: GEO Tools and Companies - Several tools and companies are emerging in the GEO space, focusing on enhancing visibility and citation in AI search environments [49][50] - Platforms like Profound and Goodie AI are designed to optimize content for AI retrieval and improve brand exposure [56][57] - The competitive landscape for GEO tools is evolving, with a focus on integrating AI capabilities into traditional SEO practices [66][68]
中国人形机器人_ 人工智能大会要点_ 轮式机器人演示比双足更常见,应用更广泛-China Humanoid Robot_ WAIC 2025 takeaways_ Broader applications with wheel-based robot demo more common than bipedal
2025-07-29 02:31
Summary of WAIC 2025 Takeaways Industry Overview - The conference showcased significant advancements in the AI and robotics industry, with a 35% increase in venue size to 70,000 sqm and a 31% increase in ticket prices to Rmb168 per day, featuring 800 exhibitors (up 60% year-over-year) and over 1,200 speakers [1][2]. Core Insights 1. **Application Scenarios**: There was a more targeted exploration of application scenarios across various sectors including manufacturing, logistics, retail, and elderly care, indicating a shift towards early commercialization [2][7]. 2. **Product Improvements**: Humanoid robots demonstrated meaningful product improvements, moving from static displays to engaging in interactive task demonstrations [2][8]. 3. **Prototype Trends**: A noticeable shift towards AGV-style wheeled bases was observed, suggesting a pragmatic approach to achieving near-term commercial viability, which may negatively impact stocks related to planetary roller screw components [2][9]. 4. **Cost Trends**: Cost curves for humanoid robots are decreasing but not significantly, with the lowest ASP reported at Rmb40,000 for Unitree's new model [2][14]. 5. **Manipulation Challenges**: Manipulation remains a core challenge, with issues around success rates, robustness, and reliability still prevalent [2][12]. Notable Exhibitors and Innovations - **Noematrix**: Showcased wheel-based prototypes performing various tasks, indicating a focus on practical applications [7][18]. - **Galbot**: Demonstrated retail automation robots capable of complex tasks, achieving efficiency levels comparable to human workers [17][18]. - **AgiBot**: Introduced multiple humanoid robots targeting various applications, including logistics and customer interaction [17]. - **Unitree**: Highlighted advancements in dynamic locomotion with their humanoid robots, showcasing improved autonomous capabilities [20]. Future Outlook - The exhibition reinforced a constructive view on humanoid robots as a long-term technology trend, with expectations for a technology inflection point approaching, although not yet realized [3][12]. - Upcoming updates from Tesla's Gen 3 Optimus are anticipated to be significant for the sector [3]. Investment Recommendations - **Sanhua Intelligent Controls**: Rated as a Buy due to growth potential in auto/EV thermal management and HVAC systems [21]. - **Zhejiang Supcon Technology Co.**: Also rated as a Buy, with strong market share in process automation and potential for vertical expansion [22]. - **Best Precision**: Neutral rating, with expectations of becoming a competitive supplier for humanoid robots [23]. - **Leader Harmonious Drive Systems**: Neutral rating, with potential growth in harmonic reduction gear applications [26]. - **Shanghai Baosight Software**: Neutral rating, with concerns over reliance on related-party transactions [27]. Conclusion The WAIC 2025 highlighted significant advancements in humanoid robotics, with a clear trend towards practical applications and commercialization. The investment landscape appears promising for select companies within the sector, although challenges remain in manipulation and cost efficiency.
MiniMax 技术闭门会分享:长上下文是 Agent 的 Game Changer
Founder Park· 2025-07-18 18:24
Core Insights - The article discusses the advancements in Reinforcement Learning (RL) and its potential to enhance model capabilities, particularly in the context of limited context lengths and the importance of pre-training data diversity [6][8][10]. Group 1: RL and Model Capabilities - RL can indeed provide new capabilities to models, especially when dealing with limited context lengths, by altering the output distribution and reducing the number of tokens needed to solve specific problems [6]. - The pass@k metric is highlighted as a useful measure for evaluating model capabilities, with the definition of k being crucial depending on the problem context [7]. - Reward modeling remains a significant challenge in RL, particularly for non-outcome-based rewards, which complicates the training process [7]. Group 2: Pre-training and Data Distribution - Pre-training is essential for exposing models to diverse data distributions, which is currently more varied than the narrower distributions used in RL training [8]. - The article emphasizes that while RL can potentially fill gaps in pre-training, the quality and diversity of pre-training data are critical for effective model training [8]. Group 3: Long Context and Agent Workflows - Long context windows are identified as game-changers for agent workflows, allowing for the processing of extensive information in a single pass, which enhances output quality [15][16]. - The application of long context models is particularly beneficial in fields such as legal compliance analysis and customer research, where comprehensive data processing is required [17][18]. Group 4: Hybrid Architectures - Hybrid attention mechanisms are positioned as the future of model design, combining the strengths of linear and full attention models to improve efficiency and performance [19][20]. - The article notes that the effective deployment of hybrid architectures is currently limited by infrastructure challenges, despite their proven potential [20]. Group 5: Practical Applications and Challenges - The implementation of hybrid architectures in real-world applications is crucial, especially for handling large-scale requests efficiently [22]. - The article discusses the need for unified abstraction layers to optimize both traditional and hybrid architectures in inference engines [21]. Group 6: Future Directions - The exploration of latent reasoning and self-training models is highlighted as an exciting frontier in RL research, with implications for the development of more autonomous AI systems [13][14]. - The importance of evaluating model performance based on computational budgets rather than fixed output lengths is emphasized for a more accurate assessment of efficiency [24].
对VLA的RL最新进展的梳理~
自动驾驶之心· 2025-07-03 12:41
Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].