Workflow
强化学习
icon
Search documents
即将开课!自动驾驶VLA全栈学习路线图分享~
自动驾驶之心· 2025-10-15 23:33
Core Insights - The focus of academia and industry has shifted towards VLA (Vision-Language Action) in autonomous driving, which provides human-like reasoning capabilities for vehicle decision-making [1][4] - Traditional methods in perception and lane detection have matured, leading to decreased attention in these areas, while VLA is now a critical area for development among major autonomous driving companies [4][6] Summary by Sections Introduction to VLA - VLA is categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA, which are essential for improving the reliability and safety of autonomous driving [1][4] Course Overview - A comprehensive course on autonomous driving VLA has been designed, covering foundational principles to practical applications, including cutting-edge algorithms like CoT, MoE, RAG, and reinforcement learning [6][12] Course Structure - The course consists of six chapters, starting with an introduction to VLA algorithms, followed by foundational algorithms, VLM as an interpreter, modular and integrated VLA, reasoning-enhanced VLA, and a final project [12][20] Chapter Highlights - Chapter 1 provides an overview of VLA algorithms and their development history, along with benchmarks and evaluation metrics [13] - Chapter 2 focuses on the foundational knowledge of Vision, Language, and Action modules, including the deployment of large models [14] - Chapter 3 discusses VLM's role as an interpreter in autonomous driving, covering classic and recent algorithms [15] - Chapter 4 delves into modular and integrated VLA, emphasizing the evolution of language models in planning and control [16] - Chapter 5 explores reasoning-enhanced VLA, introducing new modules for decision-making and action generation [17][19] Learning Outcomes - The course aims to deepen understanding of VLA's current advancements, core algorithms, and applications in projects, benefiting participants in internships and job placements [24]
波士顿动力狗gogo回来了,“五条腿”协同发力
3 6 Ke· 2025-10-15 13:02
Core Insights - Boston Dynamics' Spot robot can lift a 15 kg tire in just 3.7 seconds, showcasing advanced dynamic whole-body manipulation techniques [1][11] - The robot's performance exceeds traditional static assumptions, demonstrating the ability to coordinate movements effectively beyond its maximum lifting capacity [13] Group 1: Dynamic Whole-Body Manipulation - The method combines sampling and learning to enable the robot to perform tasks requiring coordination of arms, legs, and torso [1][2] - A hierarchical control approach divides the control problem into two layers: low-level control for balance and stability, and high-level control for task-specific strategies [2][14] Group 2: Control Strategies - The low-level control uses reinforcement learning to manage motor torque for stability, while high-level control employs sampling-based strategies for tasks like tire alignment and stacking [2][7] - The sampling controller simulates multiple future scenarios in parallel to identify the most effective actions for task completion [3][5] Group 3: Performance Metrics - The robot achieved an average time of 5.9 seconds per tire, nearly matching human operational speed [11] - The dynamic coordination allows the robot to handle weights significantly exceeding its peak lifting capabilities, expanding its operational range [13][14] Group 4: Learning and Adaptation - The training process incorporates randomization of object properties to bridge the gap between simulation and real-world application [10] - The use of an asymmetric actor-critic architecture for training enhances the robot's ability to adapt to complex dynamics and contact mechanics [8][10]
Sutton判定「LLM是死胡同」后,新访谈揭示AI困境
机器之心· 2025-10-15 07:33
Core Viewpoint - The article discusses Rich Sutton's critical perspective on large language models (LLMs), suggesting they may not align with the principles outlined in his work "The Bitter Lesson" and highlighting their limitations in learning from real-world interactions [1][3][22]. Group 1: Limitations of LLMs - Sutton argues that LLMs have significant flaws, particularly their inability to learn from ongoing interactions with the environment [3][21]. - He emphasizes that true intelligence should emerge from continuous reinforcement learning through dynamic interactions, rather than relying on extensive pre-training and supervised fine-tuning [3][4][22]. - The reliance on human knowledge and data in LLMs may lead to a lack of scalability and potential failure to meet expectations, as they are fundamentally limited by the biases present in the training data [24][25][26]. Group 2: Alternative Perspectives on Intelligence - Experts in the discussion, including Suzanne Gildert and Niamh Gavin, express skepticism about achieving pure reinforcement learning, suggesting that current systems often revert to imitation learning due to the difficulty in defining universal reward functions [7][11]. - The conversation highlights the need for systems that can autonomously learn in new environments, akin to how a squirrel learns to hide nuts, rather than relying solely on pre-existing data [8][10]. - There is a consensus that while LLMs exhibit impressive capabilities, they do not equate to true intelligence, as they lack the ability to explore and learn from their environment effectively [33][35]. Group 3: The Future of AI Development - The article suggests that the AI field is at a crossroads, where the dominance of certain paradigms may hinder innovation and lead to a cycle of self-limitation [28][29]. - Sutton warns that the current trajectory of LLMs, heavily reliant on human imitation, may not yield the breakthroughs needed for genuine understanding and reasoning capabilities [22][24]. - The discussion indicates a shift towards exploring more robust learning mechanisms that prioritize experience and exploration over mere data absorption [28][30].
腾讯发布超低成本AI训练法!120元效果秒杀70000元微调方案
量子位· 2025-10-15 06:27
Core Viewpoint - Tencent proposes a new method for upgrading large model agents called Training-Free GRPO, which significantly reduces costs and improves performance without the need for parameter tuning [1][5][11]. Group 1: Methodology - The Training-Free GRPO method allows for performance enhancement by learning from brief experiences embedded in prompts, eliminating the need for parameter adjustments [2][11]. - This approach maintains the model parameters in a frozen state while dynamically updating an external knowledge base to optimize performance [14][22]. - The method leverages the core logic of traditional GRPO but transforms it into a non-parametric reasoning process [13]. Group 2: Experimental Results - Experiments demonstrate that the DeepSeek-V3.1-Terminus model using Training-Free GRPO shows significant performance improvements in mathematical reasoning and web search tasks [4][25]. - Compared to fine-tuning a 32B model, Training-Free GRPO requires less training data and incurs lower costs, with a notable example being a cost of approximately $18 compared to over $10,000 for traditional methods [5][28]. - In the AIME24 and AIME25 tests, the model's performance improved from 80.0% to 82.7% and from 67.9% to 73.3%, respectively, showcasing a clear advantage with minimal training samples [28]. Group 3: Performance Evaluation - The method achieved a Pass@1 score of 67.8% on the WebWalkerQA benchmark, a significant increase from the baseline score of 63.2% [35]. - The results indicate that the learned experiences help the model avoid redundant tool calls and improve decision-making efficiency [31][30]. - The effectiveness of Training-Free GRPO is contingent upon the underlying model's reasoning and tool usage capabilities, as demonstrated by its lower performance on less capable models [40].
刚刚,UCLA周博磊也加入了一家机器人公司
机器之心· 2025-10-15 02:54
Core Insights - Coco Robotics has appointed Bolei Zhou, a UCLA associate professor, as the Chief AI Scientist to lead the newly established Physical AI Lab, focusing on autonomous driving for sidewalks [1][2][4] - The company aims to achieve full automation in last-mile delivery, leveraging the extensive operational data collected over the past five years [2][4][6] Group 1: Company Overview - Coco Robotics, founded in 2020, specializes in last-mile delivery robotics and initially relied on teleoperators to navigate obstacles [2][4] - The company has accumulated millions of miles of data in complex urban environments, which is crucial for training reliable AI systems [4][6] Group 2: Research and Development - The Physical AI Lab will utilize the data collected to enhance automation and operational efficiency, focusing on local models used by their robots [6] - The lab operates independently from Coco Robotics' collaboration with OpenAI, which allows the use of OpenAI's models while sharing data for mutual benefit [5][6] Group 3: Bolei Zhou's Background - Bolei Zhou holds a PhD from MIT and has a strong research background in machine perception and intelligent decision-making, with over 100 publications and significant contributions to explainable AI [9][11][13] - His notable works include Class Activation Mapping (CAM) and the creation of the Places database, which contains over 10 million labeled scene images, enhancing scene recognition capabilities [11][18][20]
卡帕西 8000 行代码手搓 ChatGPT,成本仅100美元,训练 12 小时 CORE 表现超越GPT-2
程序员的那些事· 2025-10-15 00:44
Core Insights - The article discusses the launch of "nanochat," a simplified version of ChatGPT created by Andrej Karpathy, which can be built with minimal resources and code [1][2][4]. - The project aims to provide an accessible framework for training language models, emphasizing ease of use and modification [11][13]. Project Overview - "Nanochat" is a full-stack training and inference pipeline that allows users to create a basic ChatGPT-like model with approximately 8000 lines of code [2][4]. - The total cost to train this model is around $100, using a cloud GPU server for about 4 hours [4][16]. - The model is built using Rust and includes a custom tokenizer, with training conducted on the FineWeb dataset [5][19]. Performance Metrics - After approximately 12 hours of training, the model's performance on the CORE metric surpasses that of GPT-2 [8]. - Specific performance metrics include: - CORE: 0.2219 - ARC-Easy: 0.3876 - GSM8K: 0.0758 - HumanEval: 0.0854 - MMLU: 0.3151 [7][56]. Training Process - The training process involves several stages: pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning (RL) [45][50]. - The pre-training phase utilizes a large dataset to teach the model about the world, while mid-training focuses on adapting the model for conversational tasks [28][45]. - The SFT phase further refines the model using high-quality dialogue data [48]. Community Engagement - The project has gained significant attention, with over 4.8k stars on GitHub shortly after its release, indicating strong community interest [14]. - The framework is designed to be easily modifiable, allowing users to experiment with different parameters and configurations [59]. Future Potential - Karpathy envisions "nanochat" evolving into a research tool or benchmark framework, similar to previous projects like nanoGPT [13]. - The project is still in its early stages, with potential for further optimization and enhancement [13][50].
CoreWeave:一场价值数万亿美元的盛宴
3 6 Ke· 2025-10-15 00:29
Core Viewpoint - The integration of large language models (LLM) and reinforcement learning (RL) is accelerating the development of autonomous intelligent agents, positioning CoreWeave as a key cloud service provider for the AI infrastructure needed in this new phase [1] Group 1: Business Strategy and Expansion - CoreWeave's acquisition of OpenPipe is a significant move to enhance its capabilities in the reinforcement learning space, allowing it to train intelligent agents and gain developer recognition [2] - The transition from a "hardware + API" model to a comprehensive "intelligent agent support platform" represents a qualitative leap in CoreWeave's offerings [3] - The integration of reinforcement learning services is expected to significantly enhance profit margins, creating a competitive barrier that traditional hardware rental models cannot match [4] Group 2: Infrastructure Requirements - Intelligent agents require a high-performance infrastructure that includes high throughput system interconnects, fast memory, rollback architecture, data monitoring, error recovery, and modular subroutines, which traditional cloud providers cannot adequately supply [5] - The computational demands of intelligent agents are projected to be several orders of magnitude greater than traditional static inference, with the global data center spending on computing expected to rise from hundreds of billions to trillions [6][7] Group 3: Financial Performance and Market Potential - CoreWeave's quarterly sales surged by 200% year-over-year to approximately $1.21 billion, with a backlog of nearly $30 billion, indicating strong future demand [8] - The shift towards intelligent agent models is expected to drive significant growth in the market, with conservative estimates suggesting that by 2030, annual spending on computational resources could reach $1.8 trillion [9] - CoreWeave's ability to capture value from the entire decision-making cycle of intelligent agents positions it favorably against competitors, enhancing its long-term profitability [10] Group 4: Valuation and Future Outlook - CoreWeave's current valuation aligns with GPU-intensive cloud service peers, with an estimated enterprise value (EV) range of $80-100 billion, potentially increasing to $120 billion if the demand for reinforcement learning training accelerates [13] - The company's strategic shift towards becoming a comprehensive provider of reinforcement learning training solutions is expected to expand its valuation range as the revenue structure increasingly leans towards software services [14]
CoreWeave:一场价值数万亿美元的盛宴
美股研究社· 2025-10-14 12:30
大语言模型(LLM)与强化学习(RL)的融合趋势,正加速催生 "自主智能体"(能自主决 策、执行任务的 AI 系统)的发展。 目前 CoreWeave 的业务覆盖范围正快速扩张,这使其能将基础设施与服务推向更多市场和企 业,为智能体时代的规模化服务奠定基础。 从 " 算 力 供 应 商 " 到 " 智 能 体 运 行 平 台 " 收购 OpenPipe 是 CoreWeave 向 "价值链上游" 突破的关键动作。 OpenPipe 的核心竞争力是一套 "强化学习工具包"—— 开发者可借助它训练智能体,还能让 模型适配新任务需求。 此次收购后,CoreWeave 不仅掌握了智能体训练的核心技术,更获得了开发者群体的认可, 彻底打通了智能体训练的全流程。 这并非 "小幅升级",而是从 "硬件层 + API 接口" 到 "智能体全周期支持平台" 的质变。 智能体相关工作负载呈指数级增长,算力需求持续飙升; 自研强化学习工具与运行时服务(Runtime)将显著扩大利润率; 在电力供应、散热效率与 GPU 资源获取上,相比超大规模云厂商(Hyperscalers)具 备持久竞争优势。 过去,每个开发团队都需自行搭建 " ...
各大顶会对RL和这些工作的结合很青睐~
具身智能之心· 2025-10-14 10:00
Core Insights - Reinforcement Learning (RL) remains a significant field with ongoing developments and applications in various domains, including robotics and product optimization [1][2][3] - The importance of gait control in embodied intelligent robots is highlighted, with RL being the primary method for achieving complex movements [2][8] - The complexity of RL poses challenges for newcomers, necessitating structured guidance to facilitate entry into the field and successful paper publication [5][9] Group 1: Importance of Reinforcement Learning - RL is not an outdated discipline; it continues to be relevant with numerous applications in robotics, such as humanoid and quadruped robots [1][2] - Companies like Yushun and Zhiyuan utilize RL for training robots to perform various challenging tasks, including climbing stairs and running [2][8] - The integration of RL with Variable Length Action (VLA) in robotic arms is gaining traction in academic research, enhancing the efficiency of robotic operations [3][8] Group 2: Challenges in Learning and Research - The extensive and complex nature of RL makes it difficult for beginners to navigate, often leading to frustration and abandonment of studies [5][9] - A lack of a comprehensive learning framework can result in repeated mistakes and missed opportunities in research [6][9] - The introduction of a specialized 1v6 mentoring course aims to address these challenges by providing structured support for students in the RL field [6][9] Group 3: Course Structure and Offerings - The course spans 14 weeks of intensive online guidance followed by 8 weeks of maintenance support, focusing on producing a publishable paper [10][11] - Weekly live sessions will cover various topics, including RL fundamentals, simulation environments, and writing guidance, with a focus on practical applications [17][21] - Participants will have the opportunity to work on specific ideas related to quadruped, humanoid, and robotic arm research, with a structured approach to project development and writing [18][25]
0人工参与实现梯度更新,,MIT新框架让AI自动生成微调数据,权重自主升级
3 6 Ke· 2025-10-14 07:16
Core Insights - MIT has introduced a new reinforcement learning framework called SEAL (Self-Adapting LLMs), enabling models to generate fine-tuning data and self-update instructions autonomously, allowing for model weight updates without human intervention [1][3] Group 1: SEAL Framework Overview - SEAL employs a nested learning mechanism that calculates rewards based on the updated model's performance on tasks, optimizing the generation strategy for self-update instructions [3] - The framework provides large models with self-driven update capabilities at the weight level, overcoming the limitations of relying solely on external supervised data [3] Group 2: Knowledge Incorporation Experiment - In the knowledge incorporation experiment, the Qwen2.5-7B model was tested using the SQuAD dataset, where the model generated training data based on new paragraphs without seeing the corresponding answers [5] - The accuracy of the Qwen original model was 32.7%, which improved to 33.5% with original fine-tuning, 46.3% with GPT-4.1 synthetic data, and reached 47.0% using the SEAL method, demonstrating superior knowledge integration capabilities [6][10] Group 3: Large-Scale Data Testing - SEAL achieved an accuracy of 58.2% when tested with longer paragraphs, significantly outperforming the unoptimized version, indicating its ability to generalize to larger data organization tasks [8] Group 4: Few-Shot Learning Experiment - In the few-shot learning experiment, the LLaMA-3.2-1B-Instruct model was used with a subset of tasks from the ARC-AGI dataset, where SEAL generated a training configuration and executed LoRA fine-tuning [11][13] - The success rate of tasks trained with SEAL reached 72.5%, far exceeding the 0% success rate of fixed few-shot prompts and 20% of random sampling strategies, showcasing SEAL's strong task adaptation ability [15][16] Group 5: SEAL's Operational Mechanism - SEAL operates through a dual-loop system that automatically generates training instructions, allowing the model to read new information, rewrite it in its own language, and perform gradient updates for self-learning [17][18] - The outer loop generates self-edit instructions based on new input, while the inner loop executes fine-tuning according to these instructions, constructing synthetic training data and updating weights [18][20] - SEAL utilizes a non-traditional reinforcement learning method called ReSTEM, which focuses on behavior cloning and filtered sampling to optimize the generation of effective self-edit strategies [20]