Reinforcement Learning
Search documents
Uber launches an ‘AV Labs' division to gather driving data for robotaxi partners
TechCrunch· 2026-01-27 13:00
Core Insights - Uber is launching a new division called Uber AV Labs to provide data to its more than 20 autonomous vehicle partners, focusing on democratizing access to valuable real-world driving data [1][9] Group 1: Uber's Strategy and Operations - Uber is not returning to developing its own robotaxis but will collect data using its own vehicles equipped with sensors for partners like Waymo and Lucid Motors [2] - The new AV Labs division currently operates with a single vehicle, a Hyundai Ioniq 5, and is in the process of equipping it with necessary sensors [10] - Uber's VP of engineering stated that the lab aims to build a foundational data set before determining product market fit, emphasizing the company's responsibility to accelerate the autonomous vehicle ecosystem [10] Group 2: Data Collection and Value - Real-world driving data is increasingly valuable for training self-driving systems, as companies shift from rules-based operations to reinforcement learning [3] - The physical limit of an autonomous vehicle company's fleet restricts data collection, making extensive real-world driving essential for addressing edge cases [5] - Uber's approach to data collection is targeted, allowing for deployment in specific cities based on partner needs, which contrasts with Tesla's broader scale of data collection [13][14] Group 3: Collaboration with Partners - Partners will not receive raw data; instead, Uber will process the data to fit the needs of its partners, enhancing the semantic understanding for driving software [11] - Uber plans to run partner driving software in "shadow mode" to identify discrepancies and improve model training, aiming to make autonomous vehicles drive more like humans [12] - Partners have expressed a strong desire for any helpful data, recognizing that Uber's data collection capabilities far exceed their own [15]
硅谷“钱太多”毁了AI ?!前OpenAI o1负责人炮轰:别吹谷歌,Q-Star 被炒成肥皂剧,7年高压被“逼疯”!
Xin Lang Cai Jing· 2026-01-25 01:24
来源丨AI前线 编译 | Tina 这不是离职八卦,而是在一个把技术做成剧情、把研究变成围观的行业里,扛了七年高压后的选择。 2026 年的第一个月,Jerry Tworek 离开 OpenAI 的消息传出来时,几位 OpenAI 的员工在 X 上几乎失控 地发声:"我真的崩溃了""这太难受了"。大家的反应像是:这事来得太突然,也太重。 Jerry 是现代 AI 浪潮背后最有影响力、却也最少公开露面的关键人物之一。 2019 年加入 OpenAI 时, 当时该公司还只有约 30 名员工。他参与了许多最重要的项目,包括后来被称为 Q-Star 和 Strawberry 的 推理方法,最终发展成为 o1 推理模型。 这次离职后,他在接受 Core Memory 的播客采访时解释了原因:他想从事有风险的基础研究,这种研 究在像 OpenAI 这样的公司已经不可能进行了,因为像用户增长这样的指标才是优先考虑的。他对 ChatGPT 广告的看法体现了研究与商业化之间的脱节:"这是一种商业策略,而我负责训练模型。" 这 番言论印证了有关 OpenAI 人工智能研究与产品开发之间日益加剧的分歧的传言。 在 Tworek 看 ...
硅谷“钱太多”毁了AI ?!前OpenAI o1负责人炮轰:别吹谷歌,Q-Star 被炒成肥皂剧,7年高压被“逼疯”!
AI前线· 2026-01-24 05:33
Core Viewpoint - The departure of Jerry Tworek from OpenAI highlights the growing divide between AI research and commercialization, emphasizing the need for risk-taking in foundational research that is increasingly difficult in a competitive corporate environment [3][4][5]. Group 1: Departure and Industry Insights - Jerry Tworek's exit from OpenAI was met with shock among employees, indicating his significant influence within the company [3][10]. - Tworek criticized the AI industry for a lack of innovation, stating that major companies are developing similar technologies, which pressures researchers to prioritize short-term gains over experimental breakthroughs [4][5]. - He pointed out that Google's success in catching up with OpenAI was due to OpenAI's own missteps, including slow actions and failure to leverage its initial advantages [4][5]. Group 2: Organizational Challenges - Tworek identified organizational rigidity as a barrier to innovation, where team structures limit cross-team research and collaboration [4][22]. - He expressed concern that the current state of the AI industry resembles a soap opera, where personal movements and internal conflicts overshadow genuine research progress [6][7]. Group 3: Future Research Directions - Tworek emphasized the importance of exploring new research paths rather than following the mainstream trajectory, advocating for more diversity in AI model development [30][31]. - He highlighted two underexplored areas: architectural innovation beyond the Transformer model and the integration of continual learning into AI systems [45][47]. - Tworek believes that significant advancements in AI will require a shift away from the current focus on scaling existing models and towards more innovative approaches [26][28]. Group 4: AGI and Industry Evolution - Tworek updated his perspective on the timeline for achieving AGI, acknowledging that while current models are powerful, they still lack essential capabilities like continuous learning and multimodal perception [49][50]. - He noted that the rapid evolution of AI technology and increasing investment in the field could lead to breakthroughs sooner than previously anticipated [51].
为什么自动驾驶领域内的强化学习,没有很好的落地?
自动驾驶之心· 2026-01-13 03:10
Core Viewpoint - The article discusses the challenges and advancements in reinforcement learning (RL) for autonomous driving, emphasizing the need for a balanced reward system to enhance both safety and efficiency in driving models [2][5]. Group 1: Challenges in Reinforcement Learning - Reinforcement learning faces significant issues such as reward hacking, where increased safety requirements can lead to decreased efficiency, and vice versa [2]. - Achieving a comprehensive performance improvement in RL models is challenging, with many companies not performing adequately [2]. - The complexity of autonomous driving requires adherence to various driving rules, making it essential to optimize through RL, especially in uncertain decision-making scenarios [2][5]. Group 2: Model Development and Talent Landscape - The current industry leaders have developed a complete model iteration approach that includes imitation learning, closed-loop RL, and rule-based planning [5]. - The high barriers to entry in the autonomous driving sector have led to generous salaries, with top talents earning starting salaries of 1 million and above [6]. - There is a notable gap in practical experience among many candidates, as they often lack the system-level experience necessary for real-world applications [7]. Group 3: Course Offerings and Structure - The article promotes a specialized course aimed at practical applications of end-to-end autonomous driving systems, highlighting the need for hands-on experience [8]. - The course covers various chapters, including an overview of end-to-end tasks, two-stage and one-stage algorithm frameworks, and the application of navigation information [13][14][15][16]. - It also addresses the integration of RL algorithms and trajectory optimization, emphasizing the importance of combining imitation learning with RL for better performance [17][18]. Group 4: Practical Experience and Knowledge Requirements - The final chapter of the course focuses on sharing production experiences, analyzing data, models, scenarios, and rules to enhance system capabilities [20]. - The course is designed for advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [21][22].
我们在招募这些方向的合伙人(世界模型/4D标注/RL)
自动驾驶之心· 2026-01-12 09:20
Core Viewpoint - The autonomous driving industry has entered its second phase, requiring more dedicated individuals to address its challenges and pain points [2]. Group 1: Industry Direction - The main focus areas include but are not limited to: autonomous driving product management, 4D annotation/data loop, world models, VLA, large models for autonomous driving, reinforcement learning, and end-to-end solutions [4]. Group 2: Job Description - The positions are primarily aimed at training collaborations in autonomous driving, targeting B-end (enterprises, universities, research institutes) and C-end (students, job seekers) for course development and original article creation [5]. Group 3: Contact Information - For discussions regarding compensation and collaboration methods, interested parties are encouraged to add the WeChat contact wenyirumo for further communication [6].
毫无征兆,DeepSeek R1爆更86页论文,这才是真正的Open
3 6 Ke· 2026-01-09 03:12
R1论文暴涨至86页!DeepSeek向世界证明:开源不仅能追平闭源,还能教闭源做事! 全网震撼! 两天前,DeepSeek悄无声息地把R1的论文更新了,从原来22页「膨胀」到86页。 全新的论文证明,只需要强化学习就能提升AI推理能力! DeepSeek似乎在憋大招,甚至有网友推测纯强化学习方法,或许出现在R2中。 这一次的更新,直接将原始论文升级为:一份开源社区完全可复现的技术报告。 论文地址:https://arxiv.org/abs/2501.12948 论文中,DeepSeek-R1新增内容干货满满,信息含量爆炸—— | Benchmark (Metric) | | | | Claude-3.5- GPT-40 DeepSeek OpenAI OpenAI DeepSeek | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Sonnet-1022 | 0513 | V3 | o1-mini o1-1217 | | R1 | | Architecture | | - | - | MoE | - | - | MoE | | # ...
清库存,DeepSeek突然补全R1技术报告,训练路径首次详细公开
3 6 Ke· 2026-01-09 03:12
Core Insights - DeepSeek has released an updated version of its research paper on the R1 model, adding 64 pages of technical details, significantly enhancing the original content [4][25] - The new version emphasizes the implementation details of the R1 model, showcasing a systematic approach to its training process [4][6] Summary by Sections Paper Update - The updated paper has expanded from 22 pages to 86 pages, providing a comprehensive view of the R1 model's training and operational details [4][25] - The new version includes a detailed breakdown of the training process, which is divided into four main steps: cold start, inference-oriented reinforcement learning (RL), rejection sampling and fine-tuning, and alignment-oriented RL [6][9] Training Process - The cold start phase utilizes thousands of CoT (Chain of Thought) data to perform supervised fine-tuning (SFT) [6] - The inference-oriented RL phase enhances model capabilities while introducing language consistency rewards to address mixed-language issues [6] - The rejection sampling and fine-tuning phase incorporates both reasoning and general data to improve the model's writing and reasoning abilities [6] - The alignment-oriented RL phase focuses on refining the model's usefulness and safety to align more closely with human preferences [6] Safety Measures - DeepSeek has implemented a risk control system to enhance the safety of the R1 model, which includes a dataset of 106,000 prompts to evaluate model responses based on predefined safety criteria [9][10] - The safety reward model employs a point-wise training method to distinguish between safe and unsafe responses, with training hyperparameters aligned with the usefulness reward model [9] - The risk control system operates through two main processes: potential risk dialogue filtering and model-based risk review [9][10] Performance Metrics - The introduction of the risk control system has led to a significant improvement in the model's safety performance, with R1 achieving benchmark scores comparable to leading models [14] - DeepSeek has developed an internal safety evaluation dataset categorized into four main categories and 28 subcategories, totaling 1,120 questions [19] Team Stability - The core contributors to the DeepSeek team have largely remained intact, with only five out of over 100 authors having left, indicating strong team retention in a competitive AI industry [21][24] - Notably, a previously departed author has returned to the team, highlighting a positive team dynamic compared to other companies in the sector [24]
强化学习环境与科学强化学习:数据工厂与多智能体架构 --- RL Environments and RL for Science_ Data Foundries and Multi-Agent Architectures
2026-01-07 03:05
Summary of Key Points from the Conference Call Industry Overview - The focus of the conference call is on the scaling of Reinforcement Learning (RL) and its applications across various domains, including AI capabilities, coding environments, and data foundries [2][3][51]. Core Insights and Arguments 1. **Scaling RL as a Critical Path**: The scaling of RL is identified as essential for unlocking further AI capabilities, with significant performance gains attributed to increased RL compute [2][4]. 2. **OpenAI's Model Performance**: OpenAI has demonstrated that improvements in model performance over the past 18 months were primarily driven by post-training and scaling up RL compute, using the same base model across various flagship models [4][6]. 3. **Challenges in Scaling RL**: The scaling of RL faces challenges due to the need for a continuous stream of tasks for models to learn from, which is labor-intensive compared to pre-training that utilizes vast internet data [7]. 4. **Task Aggregation**: Companies like Windsurf and Cursor have managed to create competitive models by aggregating tasks and data, even without lab-level resources [9]. 5. **Utility and Capability Evaluation**: OpenAI's GDPval evaluation measures model improvements across 1,000+ tasks in 44 occupations, indicating a shift from abstract intelligence measurement to real-world utility [10][14]. 6. **Autonomous AI Development**: Companies like OpenAI and Anthropic are targeting the development of autonomous AI researchers by 2028 and 2027, respectively, indicating a trend towards models that can operate independently for longer periods [16]. Additional Important Content 1. **Outsourcing Data Tasks**: The need for significant data and task curation has led to outsourcing, with companies like Scale AI historically being major contractors but now absorbed by Meta [19][21]. 2. **Emergence of New Companies**: Over 35 companies have emerged to provide RL environments, focusing on various domains, including website cloning and more sophisticated software environments [24][29]. 3. **Demand for Coding Environments**: There is a high demand for coding environments, with companies acquiring defunct startups for their GitHub repositories to create these environments [37][38]. 4. **Expert Contractors**: Firms like Surge and Mercor are utilized to hire domain-specific experts for task creation, with Surge being a significant player with an estimated annual recurring revenue of around $1 billion [55]. 5. **Chinese Market Dynamics**: Chinese VC firms are attempting to establish local data foundry competitors to serve the ecosystem at lower costs, with most Chinese labs still in early stages of scaling RL [58][59]. This summary encapsulates the key points discussed in the conference call, highlighting the advancements, challenges, and market dynamics within the RL and AI landscape.
OpenAI前首席科学家Ilya Sutskever:规模神话的终结,回到研究时代
3 6 Ke· 2026-01-04 05:13
Core Insights - The conversation emphasizes a shift from scaling AI models to a renewed focus on research and understanding the underlying principles of AI development [26][36] - Ilya Sutskever expresses skepticism about the belief that simply increasing the scale of AI models will lead to transformative changes, suggesting that the industry may need to return to fundamental research [26][36] - The discussion highlights the fundamental flaws in current AI models, particularly their lack of generalization capabilities and the disconnect between evaluation metrics and real-world performance [37] Group 1: AI Development and Research - Ilya Sutskever's return to public discourse is significant, especially after his departure from OpenAI and the founding of Safe Superintelligence (SSI), which has raised $3 billion with a valuation of $32 billion [2][3] - The AI research community reacted strongly to Sutskever's podcast appearance, indicating his influence and the importance of his insights on AI development [3][4] - The conversation begins with a philosophical observation about the current state of AI, likening it to science fiction becoming reality, and questioning the normalization of significant investments in AI [5][6] Group 2: Economic Impact and AI Models - Sutskever discusses the puzzling lag between the impressive performance of AI models in evaluations and their economic impact, suggesting that current models may be overly focused on specific tasks [7][8] - He presents two explanations for this phenomenon: the narrow focus induced by reinforcement learning and the tendency of researchers to optimize for evaluation metrics rather than real-world applicability [10][12] - The analogy of two students in competitive programming illustrates the difference between specialized training and broader learning capabilities, emphasizing the limitations of current AI training methods [14][16] Group 3: Emotional Intelligence and Decision-Making - The role of emotions in human decision-making is explored, with Sutskever citing a case study that highlights the importance of emotional processing in effective decision-making [18][19] - He posits that human emotional intelligence may serve as a value function, guiding decisions in a way that current AI models lack [21][22] - The conversation raises fundamental questions about why humans exhibit superior generalization abilities compared to AI models, suggesting that understanding this difference is crucial for advancing AI [22][23] Group 4: Future of AI and SSI's Direction - Sutskever suggests that the AI industry is at a crossroads, moving from an era of scaling to one of research, where the focus will shift back to experimentation and understanding [26][27] - SSI's initial goal of developing superintelligence without market pressures may evolve as Sutskever acknowledges the challenges of conceptualizing AGI [28][29] - The discussion concludes with a reflection on the timeline for achieving superintelligence, with Sutskever estimating a range of 5 to 20 years, which is more conservative than some industry predictions [33][34]
搞过自驾的小伙伴,在其他领域还是很抢手
自动驾驶之心· 2025-12-31 00:31
Group 1 - The core viewpoint of the article highlights the competitive landscape of the autonomous driving industry, emphasizing the focus on technology, cost, and efficiency as key areas of competition this year [1] - The industry has seen a shift with many professionals transitioning to sectors like embodied AI and drones, while autonomous driving remains a mature AI field, making algorithm talents highly sought after [1][2] - Major technological directions in autonomous driving have converged this year, including end-to-end systems, VLA, world models, and reinforcement learning, with many midstream companies tackling challenges like OCC and multi-sensor fusion perception [3] Group 2 - The membership of the paid community focused on autonomous driving has officially surpassed 4,000, indicating a growing interest in the development of technology routes and job information [3] - The company expresses gratitude to its supporters and announces various benefits and discounts for the new year, encouraging continued efforts in the upcoming year [4]