强化学习
Search documents
从目前的信息来看,端到端的落地上限应该很高......
自动驾驶之心· 2025-11-12 00:04
Core Insights - The article highlights significant developments in the autonomous driving industry, particularly the performance of Horizon HSD and the advancements in Xiaopeng's VLA2.0, indicating a shift towards end-to-end production models [1][3]. Group 1: Industry Developments - Horizon HSD's performance has exceeded expectations, marking a return to the industry's focus on one-stage end-to-end production, which has a high potential ceiling [1]. - Xiaopeng's VLA2.0, which integrates visual and language inputs, reinforces the notion that value-added (VA) capabilities are central to autonomous driving technology [1]. Group 2: Educational Initiatives - The article discusses a new course titled "Practical Class for End-to-End Production," aimed at sharing production experiences in autonomous driving, focusing on various methodologies including one-stage and two-stage frameworks, reinforcement learning, and trajectory optimization [3][8]. - The course is limited to 40 participants, emphasizing a targeted approach to skill development in the industry [3][5]. Group 3: Course Structure - The course consists of eight chapters covering topics such as end-to-end task overview, two-stage and one-stage algorithm frameworks, navigation information applications, reinforcement learning algorithms, trajectory output optimization, fallback solutions, and production experience sharing [8][9][10][11][12][13][14][15]. - Each chapter is designed to build upon the previous one, providing a comprehensive understanding of the end-to-end production process in autonomous driving [16]. Group 4: Target Audience and Requirements - The course is aimed at advanced learners with a background in autonomous driving algorithms, reinforcement learning, and programming skills, although it is also accessible to those with less experience [16][17]. - Participants are required to have a GPU with recommended specifications and a foundational understanding of relevant mathematical concepts [17].
6666!NuerIPS满分论文来了
量子位· 2025-11-11 11:11
Core Insights - The article discusses a groundbreaking paper that challenges the prevailing belief that reinforcement learning (RL) is essential for enhancing reasoning capabilities in large language models (LLMs), suggesting instead that model distillation may be more effective [1][5][12]. Group 1: Research Findings - The paper titled "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" received a perfect score at NeurIPS, indicating its significant impact [5][6]. - The research team from Tsinghua University and Shanghai Jiao Tong University found that RL primarily reinforces existing reasoning paths rather than discovering new ones, which contradicts the common assumption that RL can expand a model's reasoning capabilities [10][12]. - The study utilized the pass@k metric to evaluate model performance, revealing that RL models perform better at lower sampling rates but are outperformed by base models at higher sampling rates, indicating that the base model's reasoning abilities may be underestimated [14][20]. Group 2: Methodology - The research involved testing various models across three key application areas: mathematical reasoning, code generation, and visual reasoning, using authoritative benchmark datasets [17][19]. - The models compared included mainstream LLMs like Qwen2.5 and LLaMA-3.1, with RL models trained using algorithms such as PPO, GRPO, and Reinforce++ [18][19]. - The analysis focused on the differences in pass@k performance between RL and base models, as well as the trends in performance as sampling increased [21][22]. Group 3: Implications for the Industry - The findings suggest that the substantial investments and explorations surrounding RLVR may need to be reevaluated, as the actual benefits of RL in enhancing reasoning capabilities could be overestimated [4][12]. - The research highlights the potential of model distillation as a more promising approach for expanding reasoning capabilities in LLMs, which could shift industry focus and funding [10][12].
不怕Claude断供,豆包编程模型来了,5分钟造“我的世界”翻版,花费2毛钱
3 6 Ke· 2025-11-11 09:25
同时,Doubao-Seed-Code是国内首个支持视觉理解能力的编程模型,它可参照UI设计稿、截图或手绘草图生成代码,或对生成页面进行视觉比 对,自主完成样式修复和Bug修复,大幅提升前端开发效率。 首款豆包编程模型,来了! 智东西11月11日报道,今天,字节跳动旗下云和AI服务平台火山引擎,发布了豆包大模型家族中的首款编程模型——Doubao-Seed-Code。这 是一款专门面向Agentic Coding任务优化的编程模型,并在性价比上实现了突破。 性能方面,在业内多个主流编程测评集中,Doubao-Seed-Code的得分超过了DeepSeek-V3.1、Kimi-K2、GLM-4.6等国产模型,整体表现仅次于 当前AI编程领域的顶级模型——Claude Sonnet 4.5。此外,Doubao-Seed-Code拥有原生256K上下文,比Claude Sonnet 4.5的200K上下文还要 高。 榜单之外,Doubao-Seed-Code还注重在真实编程场景的落地。得益于其专门面向主流开发工具的优化,无论是Claude Code、Trae还是veCLI 的用户,都能轻松上手,并获得稳定的输出效果 ...
上交×蚂蚁发布 DiagGym:以世界模型驱动交互式医学诊断智能体
机器之心· 2025-11-11 08:40
Core Insights - The article discusses a new training framework for AI diagnostic agents, emphasizing the need for dynamic decision-making in clinical diagnosis rather than relying on static data [2][6][10]. Group 1: Framework and Model Development - A novel "Environment-Agent" training framework has been proposed, which includes the creation of a medical diagnostic world model called DiagGym, designed to train self-evolving diagnostic agents known as DiagAgent [2][10]. - DiagGym simulates a virtual clinical environment where diagnostic agents can interact with virtual patients, allowing them to refine their decision-making strategies through continuous feedback [10][14]. - The framework incorporates a comprehensive evaluation benchmark called DiagBench, which consists of 750 cases and 973 detailed assessment criteria developed by physicians to evaluate the diagnostic reasoning process [2][12]. Group 2: Training and Evaluation - The training of DiagAgent involves two main phases: supervised fine-tuning using real clinical interaction data and reinforcement learning in the DiagGym environment to enhance decision-making capabilities [19][15]. - Experimental results indicate that DiagAgent significantly outperforms other advanced models like DeepSeek and Claude-4 in multi-step diagnostic decision-making [12][25]. - The evaluation metrics include diagnostic accuracy, quality of examination recommendations, and efficiency in completing diagnoses, with DiagAgent showing a 44.03% improvement in recommendation hit rate and a 9.34% increase in final diagnosis accuracy compared to other models [25][28]. Group 3: Research Value and Future Prospects - The research aligns AI diagnostics more closely with real clinical workflows by transitioning from static question-answering to dynamic strategy learning, enabling agents to actively gather evidence and make assessments [36][41]. - Future expansions may include integrating treatment plans and prognostic evaluations into the virtual environment, aiming to create a comprehensive diagnostic and treatment AI system [38][40]. - The DiagGym model can be enhanced by incorporating additional dimensions such as treatment feedback and cost/safety constraints, leading to a more holistic virtual clinical system [40][41].
腾讯优图提出Training-Free GRPO,8美元即可对DeepSeek-V3.2做强化学习
腾讯研究院· 2025-11-10 11:08
Core Insights - The article discusses the revolutionary approach of Training-Free GRPO, which allows for cost-effective reinforcement learning without modifying model parameters, aligning with Richard Sutton's vision of intelligent agents learning from their own experiences rather than solely from human data [4][8][28]. Cost and Efficiency - Traditional reinforcement learning (RL) methods can cost around $10,000 for training a 32B model, while Training-Free GRPO reduces this cost to approximately $8 to $18 for optimizing a 671B model [25]. - The Training-Free GRPO method enables significant cost savings and efficiency improvements, making reinforcement learning accessible to smaller teams and individual developers [28][25]. Methodology - The Training-Free GRPO process involves four key steps: 1. Multi-path exploration to generate various solution paths for a problem [14]. 2. Providing minimal sample rewards to guide the model's learning direction [15]. 3. Semantic advantage extraction through self-reflection on different answers [16]. 4. Optimizing the experience library based on validated strategies [17][20]. Performance Improvement - Using only 100 training samples, the Training-Free GRPO can enhance performance on the AIME leaderboard, achieving a Mean@32 score increase from 68.6 to 72.6 [19]. - In web search scenarios, the method achieved a 4.6% improvement in Pass@1 metrics without updating model parameters [22][23]. Application Scenarios - Training-Free GRPO is particularly suitable for long-tail niche applications, rapid iteration scenarios, and teams with limited budgets, such as individual developers and small enterprises [26]. Conclusion - The introduction of Training-Free GRPO marks a new era in reinforcement learning, making it feasible for a broader range of developers and applications, thus democratizing access to advanced AI capabilities [28].
第八届 「GAIR 全球人工智能与机器人大会」即将启幕:穿越AI长夜,共睹群星闪耀
雷峰网· 2025-11-10 10:05
Core Insights - The GAIR Global Artificial Intelligence and Robotics Conference will take place on December 12-13, 2025, in Shenzhen, focusing on the advancements in AI and robotics [2][10] - The conference will feature discussions on large models, embodied intelligence, computational power transformation, reinforcement learning, and world models, showcasing the forefront of AI exploration [3][4] - The event aims to bridge academia and industry, highlighting the importance of collaboration in advancing AI technologies and their applications in the real world [4][9] Group 1 - The conference will host top scholars from Europe, the United States, Japan, and China to explore the deep integration of AI with the physical world [4] - The commercialization of AI is described as a challenging journey, with entrepreneurs and industry giants sharing their practical methodologies [4] - The focus on computational power as a critical area for economic development will include insights into market and policy dynamics surrounding large-scale computational infrastructure [4] Group 2 - GAIR has evolved since its inception in 2016, consistently attracting leading scientists and researchers, including Turing and Nobel Prize winners [5][7] - The conference has marked significant milestones in the history of AI in China, such as the participation of influential female scientists and the attendance of over 5,000 AI experts [7] - The event serves as a platform for connecting ideas and practices, fostering collaboration between different generations of researchers and practitioners in the AI field [9]
关于理想VLA未来发展的一些信息
自动驾驶之心· 2025-11-10 03:36
Core Viewpoint - The article discusses the future of Li Auto's VLA (Vehicle Learning Architecture), emphasizing the development of a reinforcement learning closed loop by the end of 2025, which is expected to significantly enhance user experience and vehicle performance [2][3]. Short-term Outlook - Li Auto aims to establish a reinforcement learning closed loop by the end of 2025, with expectations of noticeable improvements in vehicle performance and user perception by early 2026 [2]. Mid-term Outlook - After strengthening the reinforcement learning closed loop, Li Auto anticipates surpassing Tesla in the Chinese market due to its unique advantages in closed-loop iteration [3]. - The transformation brought by VLA's reinforcement learning is seen as a significant business change, creating a true competitive moat for the company, which will take 1-2 years to fully implement [3]. Long-term Outlook - VLA is projected to achieve Level 4 autonomy, but new technologies are expected to emerge beyond this [4]. - Current safety restrictions are in place to mitigate risks, with the system designed to autonomously identify and address issues through data collection and training [4]. Key Insights on VLA - Li Auto's leadership believes that the intelligence required for driving is relatively low, and after business process reforms, the computational needs for vehicle performance will not be excessively high [5][6]. - The company is focusing on a balanced computational requirement of around 1000 to 2000 TOPS for vehicles and 32 billion for cloud processing [6]. Organizational Adjustments - Li Auto's autonomous driving department is undergoing structural changes to enhance its business system rather than relying on individual talents, with a focus on AI-oriented organization [12]. - The restructuring includes splitting existing teams into specialized departments to improve efficiency and innovation [12]. Competitive Landscape - Li Auto's approach to VLA has faced skepticism from competitors, but the company views this as validation of its strategy [14]. - The article highlights the importance of data quality and distribution in achieving effective autonomous driving, emphasizing the need for human-like reasoning capabilities in systems [18]. Strategic Focus - The company is committed to delivering substantial functional upgrades and user experience improvements on a quarterly basis [18]. - Li Auto's leadership emphasizes the importance of clear communication of company strategy to engage younger employees effectively [18].
机器人训练,北京男大有了技能玩法
具身智能之心· 2025-11-10 00:02
作者丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 还得是大学生会玩啊(doge)! 网上正高速冲浪中,结果意外发现:有男大竟找了个机器人队友?而且机器人还相当黏人(bushi~ 白天超市打工它要跟着,一看东西装好就立马乐颠颠帮忙拉小推车,上楼下楼忙个不停: 等到中午去食堂兼职,它也自告奋勇帮忙推餐车,而且指哪打哪 (拍拍头就知道你想让它停下) : 甚至,一天劳作结束后,连健身它也要一起。既然来都来了,男大表示:那就练起来! 笑死,感觉可以以机器人视角去拍vlog了,标题就叫《高能量之机器人的一天》。 言归正传,不知道大家发现没有,图中男大和机器人伙伴的交流都是通过拍拍头、拉拉身体搞定的, 既没有遥控、也没有语音 。 这就有点东西了!要知道目前绝大多数机器人都是靠外部传感器 (摄像头、激光雷达等) 和遥控驱动的,而这群男大竟提出了一种全新的方 式——仅通过 "本体感知(Proprioception)" 就能和外界交互。 好好好,搞半天人家这 ...
招募4D标注和世界模型方向的合伙人!
自动驾驶之心· 2025-11-08 16:03
Group 1 - The article emphasizes the increasing demand for corporate training and job counseling in the autonomous driving sector, highlighting the need for diverse training programs ranging from technology updates to industry development summaries [2] - There is a notable interest from individuals seeking guidance, particularly those struggling with resume enhancement and project experience [3] - The company is actively seeking collaboration with professionals in the autonomous driving field to enhance training services, course development, and research guidance [4] Group 2 - The company offers competitive compensation and access to extensive industry resources, focusing on various areas such as autonomous driving product management, data annotation, world models, and reinforcement learning [5] - The primary target for training collaborations includes enterprises, universities, and research institutions, as well as students and job seekers [6] - Interested parties are encouraged to reach out for further consultation via WeChat [7]
招募4D标注和世界模型方向的合伙人!
自动驾驶之心· 2025-11-08 12:35
Group 1 - The article emphasizes the increasing demand for corporate training and job counseling in the autonomous driving sector, highlighting the need for various training programs and industry insights [2][4] - There is a specific focus on assisting individuals who struggle with their resumes and require project experience and guidance [3] - The company is inviting professionals in the autonomous driving field to collaborate on technical services, training, course development, and research guidance [4][5] Group 2 - The main areas of collaboration include roles such as autonomous driving product managers, 4D annotation/data closure, world models, VLA, autonomous driving large models, reinforcement learning, and end-to-end solutions [5] - The job description targets both B-end (corporate and academic training) and C-end (students and job seekers) for training cooperation, course development, and original article creation [6] - Interested parties are encouraged to reach out for further consultation via WeChat [7]