强化学习
Search documents
AI大家说 | 重磅嘉宾齐聚,近期Dwarkesh Podcast都聊了些什么?
红杉汇· 2025-12-11 00:04
Core Insights - The podcast "Dwarkesh Podcast" has become a crucial source of information in the AI industry, featuring in-depth discussions with key figures like Satya Nadella, Ilya Sutskever, and Andrej Karpathy [2] Group 1: Insights from Ilya Sutskever - The era of blindly stacking computational power is over; the focus has shifted from scaling laws to a need for research and intuition in AI development [5] - Emotions are not a hindrance for humans but an evolutionary gift; AI lacks emotions, which limits its intelligence, and incorporating emotions may be essential for achieving true intelligence [6] - AGI should be viewed as a "15-year-old genius" with strong learning capabilities rather than an all-knowing entity [7] Group 2: Insights from Satya Nadella - Model vendors may face a "winner's curse" as models are interchangeable; Microsoft emphasizes integrating AI into applications like Excel to maintain a competitive edge [10] - GitHub is envisioned as the headquarters for future AI agents, focusing on managing multiple AI models working on code [11] - The SaaS model is evolving; future revenue may come from providing resources for AI agents rather than traditional user-based subscriptions [12][13] Group 3: Insights from Andrej Karpathy - The goal is not to create "animals" but rather "ghosts" of the internet, as current AI models lack physical intuition despite having vast knowledge [16] - Reinforcement learning (RL) is criticized for its inefficiency, as it reduces complex reasoning to a single reward signal, leading to issues like "hallucinations" in AI [17] - Future AGI may only require 1 billion parameters, separating memory from cognition to enhance efficiency [18] Group 4: Insights from Richard Sutton - Current LLMs merely mimic human speech without understanding truth, lacking the objective reality necessary for true intelligence [21] - Supervised learning is not natural; AI should learn from experiences rather than labeled data, similar to how animals learn in the wild [22] - Humanity is transitioning from a "copying era" to a "design era," where AI is designed with an understanding of its principles [23] Group 5: Insights from Sergey Levine - Robots do not need all-encompassing world models; they require a focused approach to complete tasks effectively [25] - High-level intelligence may involve "forgetting," allowing robots to react quickly without cognitive overload [26] - The failure of early autonomous driving was attributed to a lack of common sense, which modern robots are beginning to incorporate [27]
南大联合LibLib.ai、中科院自动化所,共同提出布局推理与精准编辑「海报设计大模型」PosterCopilot
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the development of PosterCopilot, a professional-level poster design and editing model that addresses significant challenges in graphic design automation, particularly in layout reasoning and controllable editing [2][6][40]. Industry Pain Points - Graphic design faces substantial challenges in achieving true automation, with existing models like Stable Diffusion struggling with layered structures, leading to material distortion and lack of fine control [6]. - Current multimodal models exhibit four critical shortcomings: severe element overlap, lack of visual feedback, regression to a single ground truth, and inability to perform layer-specific edits [8][10]. Core Achievements - PosterCopilot aims to bridge the gap between single-step generation and professional workflows through a systematic solution that incorporates a three-stage training strategy [13][14]. - The innovative three-stage training includes: 1. Perturbation Supervised Fine-Tuning (PSFT) to address geometric distortions [15]. 2. Visual-Reality Alignment Reinforcement Learning (RL-VRA) to correct overlaps and proportional issues [15]. 3. Aesthetic Feedback Reinforcement Learning (RLAF) to encourage exploration beyond ground truth layouts [15]. Generative Agent - PosterCopilot functions as a comprehensive design assistant, facilitating seamless transitions from abstract design concepts to concrete materials through a reception model and T2I model [16][17]. - The model supports various professional scenarios, including full poster generation from provided assets, intelligent completion of missing materials, global theme transitions, intelligent size reconstruction, and multi-round fine-grained editing [21][23][28][29][31]. Experimental Results - PosterCopilot outperforms existing commercial competitors and state-of-the-art models across multiple metrics, achieving an average win rate exceeding 74% in human evaluations [34][35]. - In assessments of layout rationality, text legibility, and element preservation, PosterCopilot demonstrates superior performance compared to models like Microsoft Designer and CreatiPoster [35][37]. Conclusion and Outlook - By decoupling layout reasoning from generative editing and incorporating reinforcement learning to align with human aesthetics, PosterCopilot sets a new benchmark for intelligent design tools and offers a new paradigm for AI-assisted creative workflows [40].
告别专家依赖,让机器人学会自我参考,仅需200步性能飙升至99.2%
机器之心· 2025-12-10 05:10
Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which enhances the performance of Visual Language Action (VLA) models in robotic tasks by addressing the challenges of sparse rewards and dependency on expert demonstrations [3][11]. Motivation and Contribution - Recent research indicates that reinforcement learning (RL) can significantly improve VLA models' performance both within and outside their training distribution. However, the challenge of sparse reward signals remains, particularly in VLA tasks where high computational costs and inefficient use of failure trajectory information hinder training efficiency [6][11]. - The SRPO framework alleviates the dependency on expert demonstrations and task-specific reward engineering by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [11][12]. Technical Approach - SRPO employs a "learn from success" paradigm, where trajectories generated during policy inference are collected and categorized into successful and failed attempts. The framework uses a potential world representation to model behavior similarity and calculate progressive rewards [14][16]. - The framework formalizes the robotic decision-making process as a partially observable Markov decision process (POMDP), introducing a world model-driven reward modeling mechanism that provides progressive reward signals for failed trajectories [18][19]. Experimental Results - SRPO achieved a success rate of 99.2% with only 200 steps of reinforcement learning, significantly outperforming baseline models that rely on sparse rewards or require manual reward design [27]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, even without training on any generalized scenario data [30]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, showcasing its superior information utilization compared to traditional methods [34]. - The reward modeling of SRPO has been tested in real-world environments, showing significant success rate improvements for various tasks [37]. Conclusion - SRPO represents a significant advancement in VLA reinforcement learning, enabling robots to transition from imitation to autonomous exploration without the need for expensive data labeling or complex reward designs [51].
随到随学!端到端与VLA自动驾驶小班课正式结课
自动驾驶之心· 2025-12-09 19:00
Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry has two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach that directly models vehicle trajectories from sensor inputs [1]. - Since last year, the single-stage end-to-end development has rapidly advanced, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based single-stage methods [3][5]. - Major players in the autonomous driving sector, including both solution providers and car manufacturers, are focusing on self-research and production of end-to-end autonomous driving technologies [3]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, aimed at teaching cutting-edge algorithms in both single-stage and two-stage end-to-end approaches, with a focus on the latest developments in the industry and academia [5][14]. - The course is structured into several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge on various technologies such as VLA, diffusion models, and reinforcement learning [8][9]. - The second chapter is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [9]. Group 3: Technical Focus Areas - The course covers various subfields of single-stage end-to-end methods, including perception-based (UniAD), world model-based, diffusion model-based, and the currently popular VLA-based approaches [10][12]. - The curriculum includes practical assignments, such as RLHF fine-tuning, and aims to provide students with hands-on experience in building and experimenting with pre-trained and reinforcement learning modules [11][12]. - The course emphasizes the importance of understanding BEV perception, multi-modal large models, and the latest advancements in diffusion models, which are crucial for the future of autonomous driving [12][16].
端到端落地小班课:核心算法&实战讲解(7个project)
自动驾驶之心· 2025-12-09 19:00
Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving sector, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new advanced course focused on end-to-end production in autonomous driving has been designed, emphasizing practical applications and real-world experience [2][4] Course Overview - The course is structured to cover various core algorithms, including one-stage and two-stage end-to-end methods, navigation information applications, reinforcement learning, and trajectory optimization [2] - The course aims to provide in-depth knowledge and practical skills necessary for production in autonomous driving, with a focus on real-world applications and challenges [2][4] Chapter Summaries - **Chapter 1: Overview of End-to-End Tasks** Discusses the integration of perception tasks and the learning-based design of control algorithms, which are essential skills for companies in the end-to-end era [7] - **Chapter 2: Two-Stage End-to-End Algorithm Framework** Introduces the modeling methods of two-stage frameworks and the information transfer between perception and planning, including practical examples [8] - **Chapter 3: One-Stage End-to-End Algorithm** Focuses on one-stage frameworks that allow for lossless information transfer, presenting various methods and practical learning experiences [9] - **Chapter 4: Production Application of Navigation Information** Covers the critical role of navigation information in autonomous driving, detailing mainstream navigation map formats and their integration into models [10] - **Chapter 5: Introduction to RL Algorithms in Autonomous Driving** Explains the necessity of reinforcement learning in conjunction with imitation learning to enhance the model's ability to generalize [11] - **Chapter 6: Trajectory Output Optimization** Engages participants in practical projects focusing on algorithms based on imitation learning and reinforcement learning [12] - **Chapter 7: Safety Net Solutions - Spatiotemporal Joint Planning** Discusses post-processing logic to ensure model accuracy and stability in trajectory outputs, introducing common smoothing algorithms [13] - **Chapter 8: Experience Sharing on End-to-End Production** Provides insights on practical experiences in production, addressing data, models, scenarios, and strategies for system capability enhancement [14] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15][17]
极客公园创新大会 2026在京落幕,罗永浩、张楠、何小鹏、刘靖康等共议 AI 时代「进程由我」
Xin Lang Cai Jing· 2025-12-09 10:23
Group 1 - The core theme of the GeekPark Innovation Festival 2026 is "On The Loop!" emphasizing the importance of human judgment and action in the AI era [2][26] - The festival has been held for 16 years, showcasing notable figures like Elon Musk and Jack Ma, and has evolved into a platform for entrepreneurs combining content community and early-stage investment [2][25] - The event gathered over 40 global innovators to discuss technology trends, product innovation, and the future of humanity through various formats including main stage speeches and "AI Product Flash" sessions [3][25] Group 2 - The main stage discussions focused on "non-consensus" inquiries and the "next 1500 days," exploring the intersection of technology trends and human values [5] - Keynote speakers included Xiaopeng He from XPeng Motors discussing the integration of AI into the physical world, and Liu Jingkang from Insta360 addressing the competition between optical and computational aspects in image creation [6][8] - The festival featured deep discussions on the evolution of human-machine relationships, with insights from various entrepreneurs on the essence and boundaries of AI companionship [10][19] Group 3 - The event included specialized sessions on individual empowerment through AI, the redefinition of relationships with AI, and the integration of AI into physical spaces [17][21] - Notable innovations presented during the "AI Product Flash" included new AI applications and products aimed at enhancing user experience and addressing market gaps [25] - The festival reinforced GeekPark's mission to discover, connect, and empower innovators, highlighting its role as a significant annual summit in China's technology and innovation landscape [25][26]
AI需要能自我改进!AI圈越来越多人认为“当前AI训练方法无法突破”
Hua Er Jie Jian Wen· 2025-12-09 01:49
来自OpenAI、谷歌等公司的小部分但日益增长的AI开发者群体认为,当前的技术路径无法实现生物 学、医学等领域的重大突破,也难以避免简单错误。这一观点正在引发行业对数十亿美元投资方向的质 疑。 据The Information周二报道,上周在圣地亚哥举行的神经信息处理系统大会(NeurIPS)上,众多研究 人员讨论了这一话题。他们认为,开发者必须创造出能在部署后持续获取新能力的AI,这种"持续学 习"能力类似人类的学习方式,但目前尚未在AI领域实现。 然而,技术局限已拖慢企业客户对AI代理等新产品的采购。模型在简单问题上持续犯错,AI代理在缺 乏AI提供商大量工作确保其正确运行的情况下往往表现不佳。 这些质疑声与部分AI领袖的乐观预测形成对比。Anthropic首席执行官Dario Amodei上周表示,扩展现有 训练技术就能实现通用人工智能(AGI),OpenAI首席执行官Sam Altman则认为两年多后AI将能自我 改进。但如果质疑者是对的,这可能令OpenAI和Anthropic明年在强化学习等技术上投入的数十亿美元 面临风险。 尽管存在技术局限,当前AI在写作、设计、购物和数据分析等任务上的表现仍推 ...
达晨、华控领投,极佳视界A2轮再融2亿,押注“世界模型+行动模型”原生架构
Tai Mei Ti A P P· 2025-12-08 07:17
Group 1 - The company, Jiga Vision, has completed a new round of financing, raising 200 million yuan in Series A2 funding, led by Dashen Caizhi, with participation from several notable investors, bringing the total funding raised in the last three months to 500 million yuan [2] - The founder and CEO, Dr. Huang Guan, has a strong background in AI and robotics, having previously worked at leading research institutions and has been instrumental in the evolution of physical AI from its inception to industrial application [2][3] - Jiga Vision has introduced a new paradigm for artificial general intelligence (AGI) that emphasizes a "world model + action model + reinforcement learning" framework, indicating a shift towards general action models in the industry [3] Group 2 - The company has officially launched two core models for physical AGI: GigaBrain-0, an end-to-end decision control model, and GigaWorld-0, a high-quality world model, along with the Maker H01 robot platform [4] - GigaBrain-0 enhances 3D spatial perception and structured reasoning capabilities, significantly improving navigation accuracy and task execution in complex environments, outperforming current state-of-the-art methods in various benchmarks [5] - GigaWorld-0 generates high-fidelity, controllable, and diverse interactive data, achieving nearly 300% performance improvement in key generalization dimensions, making it a cost-effective solution in the current market [6] Group 3 - Maker H01 is designed for open environments in home, commercial, and light industrial applications, featuring a dual-arm and omnidirectional mobile chassis, capable of performing precise operations and complex tasks [6][7] - The integration of GigaBrain-0, GigaWorld-0, and Maker H01 accelerates the transition of embodied intelligence from the laboratory to scalable applications, marking a significant step towards a reliable and generalizable physical AGI era [7]
端到端岗位求职:核心算法&实战讲解(7个project)
自动驾驶之心· 2025-12-08 00:02
Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving industry, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new course titled "End-to-End Practical Class for Mass Production" has been designed to address the skills gap in the industry, focusing on practical applications and mass production experiences [2][4] Course Overview - The course aims to cover core algorithms such as one-stage and two-stage end-to-end methods, navigation information applications, reinforcement learning, and trajectory optimization [2] - It is structured into eight chapters, each focusing on different aspects of end-to-end autonomous driving systems, including task overview, algorithm frameworks, navigation applications, and production experiences [5][7][8][9][10][11][12][13][14] Target Audience - The course is designed for advanced learners with a background in autonomous driving perception, reinforcement learning, and programming languages like Python and PyTorch [15][16] - It emphasizes practical skills and aims to prepare participants for real-world applications in the autonomous driving sector [2][15] Course Schedule - The course will commence on November 30, with a duration of approximately three months, featuring offline video lectures and online Q&A sessions [15][17]
Agent微调复活?英伟达开源8B新模型带飞GPT-5:在HLE狂卷37分,还把成本打下来
量子位· 2025-12-07 04:35
Core Insights - The article introduces a new paradigm in AI model orchestration, utilizing a smaller 8B model as a conductor to coordinate various tools and larger models, achieving better performance at lower costs [1][13]. Group 1: Model Performance - The Orchestrator-8B model achieved a score of 37.1% in the Humanity's Last Exam, outperforming GPT-5, which scored 35.1%, while also reducing computational costs by 2.5 times [1][9]. - In the FRAMES benchmark, Orchestrator-8B scored 76.3, compared to GPT-5's 74.0, and in the τ²-Bench, it scored 80.2 against GPT-5's 77.7 [9][10]. - The average cost for Orchestrator-8B was only 9.2 cents, with a latency of 8.2 minutes, significantly lower than GPT-5 [9][10]. Group 2: ToolOrchestra Framework - ToolOrchestra integrates various tools into a unified JSON interface, allowing the 8B conductor to think, call, and read feedback in multiple rounds until convergence [4]. - The framework employs GRPO reinforcement learning to maximize three rewards: correctness, efficiency, and user preference [4][5]. Group 3: User Preferences and Biases - The article highlights two biases in large models: self-enhancing bias, where models prefer to call upon similar models, and blind reliance on the strongest models, leading to increased costs [4][5]. - User preferences are taken into account, allowing the conductor to balance between local and cloud searches, speed, and cost [5][15]. Group 4: Application Scenarios - The Orchestrator-8B can be applied in various scenarios, such as internal Q&A and report analysis, where it defaults to local indexing and code execution for 80% of tasks [16]. - In research and development, it can set time and cost limits while considering source preferences [16]. - The framework allows for an end-to-end orchestration of functions and tools, moving away from rigid programming structures [16]. Group 5: Future Directions - The paper has made all code, models, and datasets publicly available for academic and industrial follow-up [14]. - The approach emphasizes a shift from relying solely on the strongest models to a more efficient use of diverse tools and models, enhancing cost-effectiveness and performance [15].