强化学习
Search documents
大模型学会拖进度条看视频了,阿里新研究让视频推理告别脑补,实现证据链思考
3 6 Ke· 2026-01-29 09:29
Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think," contrasting with mathematical reasoning where reinforcement learning (RL) shows substantial performance improvements [1][11] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include high-fidelity temporal subtitles, high-difficulty video Q&A that require detailed video content for answers, and a video-grounded reasoning chain that simulates human-like review and confirmation behaviors [2][4] Group 2: ReWatch-R1 Model - The ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process, rather than just the final answer [6][8] - The process reward is calculated through observation and reasoning rewards, ensuring that the model learns to derive answers based on accurate observations and effective reasoning actions [8] Group 3: Experimental Results - ReWatch-R1 has achieved state-of-the-art (SOTA) performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models, validating the effectiveness of the proposed methodology [9] - A critical insight from the experiments indicates that while supervised fine-tuning (SFT) does not surpass direct answering modes, the RL phase leads to a remarkable performance leap for the "thinking mode," underscoring the necessity of explicit, evidence-based reasoning processes in complex video tasks [11] Group 4: Conclusion - The work on ReWatch-R1 contributes valuable insights and resources to the field of video understanding, addressing the core bottleneck of high-quality video reasoning data and successfully teaching models to engage in deep thinking based on video evidence [13]
大模型学会拖进度条看视频了!阿里新研究让视频推理告别脑补,实现证据链思考 | ICLR 2026
量子位· 2026-01-29 08:27
Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think" [1] - They propose a high-quality video reasoning dataset called ReWatch and a state-of-the-art model named ReWatch-R1, which can "rewatch" videos like humans to enhance reasoning capabilities [1] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include: 1. High-fidelity temporal captions that provide detailed event descriptions with precise timestamps, forming a solid factual basis for complex reasoning [2] 2. High-difficulty video Q&A that ensures questions depend on video details, preventing models from relying on guessing or common sense [2] 3. Video-grounded reasoning chains that simulate human behavior of "rewatching and confirming" through a multi-agent framework, ensuring reasoning steps are closely tied to video content [2] Group 2: ReWatch-R1 Model - The training of the ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process [6] - The core of the training method is the process reward mechanism (GRPO with O&R Reward), which supervises and rewards the model's intermediate reasoning steps rather than just the final answer [6][8] - The process reward is calculated based on: 1. Observation Reward, which evaluates the accuracy of the model's observations against high-fidelity captions [8] 2. Reasoning Reward, which assesses the effectiveness of the model's reasoning actions based solely on its observations [8] Group 3: Experimental Results and Insights - ReWatch-R1 has achieved state-of-the-art performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models [9] - A key insight from the research is that reinforcement learning (RL) is crucial for unlocking the "thinking" potential of models, as it allows for a substantial performance leap in the reasoning mode compared to the direct answering mode [11][12] - The study emphasizes that explicit, step-by-step reasoning processes supported by evidence are vital for tackling complex video tasks, with RL being the key to fostering this capability [12][14]
来这场沙龙,一览SGLang X 超长上下文扩展、RL后训练框架、扩散语言模型等前沿技术实践
机器之心· 2026-01-29 08:12
Core Insights - The article discusses the transition of artificial intelligence from a "chat" paradigm to an "actionable" intelligent agent era, emphasizing the need for deep collaboration and experience sharing among developers in optimizing LLM systems [2] Event Overview - A Meetup organized by SGLang community, Machine Heart, and Zhangjiang Incubator will take place on February 6, focusing on LLM system optimization and practical implementation [2] - The event will feature discussions on SGLang's technical roadmap, long-context expansion, RL post-training frameworks, and diffusion language model exploration [2] Event Schedule - The event schedule includes: - 13:30-14:00: Registration - 14:00-14:30: Keynote on SGLang roadmap by Zhang Bozhou, core developer of SGLang [5] - 14:30-15:00: Keynote on Omni-infer performance optimization by Zheng Jinhwan, core developer of Omni-infer [5] - 15:00-15:30: Keynote on slime RL scaling post-training framework by Xie Chengxing, Tsinghua University PhD student [5] - 15:30-16:00: Keynote on SGLang CPP for long-context scaling by Cai Shangming, core developer of SGLang and Mooncake [5] Guest Introductions - Zhang Bozhou: Core developer of SGLang, focusing on open-source LLM support and optimization across different CUDA hardware [8] - Zheng Jinhwan: Huawei technical expert and core contributor to Omni-infer, specializing in high-performance systems and inference optimization [9] - Xie Chengxing: PhD student at Tsinghua University and core developer of the slime RL framework, with a focus on enhancing LLM reasoning and decision-making capabilities [10] - Cai Shangming: Researcher at Alibaba Cloud, core contributor to SGLang and Mooncake, with expertise in high-performance inference systems and distributed machine learning [10] - Li Zehuan: System engineer at Ant Group and core contributor to SGLang, focusing on AI infrastructure optimization [11]
速递|OpenAI前研究副总裁自立门户:新实验室筹集5至10亿美元融资
Z Potentials· 2026-01-29 05:35
一波所谓的人工智能新兴实验室热潮仍在持续升温, 这些机构致力于突破他们认为 OpenAI 等现有巨头将会错过的技术前沿。 特沃雷克希望开发出需要更少数据和更少服务器来进行培训的模型,这位知情人士表示。据相关物料透露,该公司打算通过设计超越当前主流模型基础架 构 ——变形器的新模型架构来实现这一目标。 此外,特沃雷克还希望将模型培训的不同步骤整合为一个统一的过程。 通过探索持续学习技术,核心自动化公司似乎与安全超级智能实验室采取了相似的发展路径 ——后者是由前 OpenAI 首席科学家伊利亚·苏茨克沃联合创办 的另一家人工智能实验室。苏茨克沃同样表示,他致力于开发能够从现实世界部署中持续学习的模型。 需要明确的是, OpenAI 和 Anthropic 等主流人工智能开发商同样对持续学习技术表现出浓厚兴趣。 部分人工智能研究人员认为,他们可以通过调整基于变 形器的 AI 模型,使其在不彻底重构架构的情况下就能展现出持续学习的特性。 由 OpenAI 前高级研究员杰里·特沃雷克近期创立的 AI 初创公司 Core Automation ,正寻求筹集 5 亿至 10 亿美元资金,一位与该公司有过交流的人士透 露。 ...
OpenAI推理第一人创业了:要造“活到老学到老”的AI,先来融它70个亿
量子位· 2026-01-29 05:03
Core Viewpoint - Jerry Tworek, a key figure in AI model reasoning, has founded a new company called Core Automation, focusing on "continuous learning" in AI models and plans to raise $1 billion (approximately 70 billion RMB) for this venture [1][15][20]. Company Background - Jerry Tworek played a crucial role in the development of OpenAI's reasoning capabilities and has a strong theoretical and mathematical background, having completed a master's degree in mathematics at the University of Warsaw [4][6][9]. - Before joining OpenAI in 2019, he worked in quantitative research, which shaped his interest in reinforcement learning [7][9]. Focus on Continuous Learning - The new company aims to address the challenge of how models can continuously learn from new data and experiences, rather than being static after deployment [12][15]. - Tworek believes that current mainstream models are limited to a "train and deploy" approach, which does not adapt to new situations encountered in real-world applications [12][22]. Implementation Strategy - Core Automation plans to develop a new architecture that does not rely on Transformers and aims to integrate the training process into a continuous system, allowing models to learn while in operation [17][20]. - The goal is to enable AI models to learn from ongoing experiences while retaining previously acquired knowledge [16][22]. Industry Context - The continuous learning approach is gaining traction, with other companies and academic institutions also exploring similar directions, such as Ilya's SSI company and Google Research's new methodologies [24][28]. - The industry consensus suggests that achieving Artificial General Intelligence (AGI) requires models to possess capabilities akin to biological systems, including continuous evolution and self-optimization, making continuous learning a critical aspect [23][24]. Future Outlook - The ambition to raise $1 billion reflects the high expectations for the potential of continuous learning in AI, with industry experts predicting that 2026 could be a pivotal year for this field [31].
月之暗面三位联创深夜回应一切,3小时答全球网友23问,杨植麟剧透Kimi K3提升巨大
3 6 Ke· 2026-01-29 00:17
Core Insights - The core discussion during the AMA focused on the advancements and future plans of the company, particularly regarding the Kimi K2.5 model and the upcoming Kimi K3 model [1][3][7]. Group 1: Company and AI Industry Insights - The company held an AMA session on Reddit, where co-founders discussed various topics related to AI and the company's direction [1][3]. - The company emphasizes a shared value of "making things happen" rather than just focusing on superficial achievements [4][9]. - The current GPU count remains a disadvantage compared to competitors, but the exact computational requirements for achieving AGI are still uncertain [8][9]. Group 2: Kimi K2.5 Technical Details - Kimi K2.5 is the company's most powerful model to date, showing strong performance in visual, programming, and general tasks, with a notable feature called "agent swarm" that can manage up to 100 sub-agents, improving task execution efficiency by up to 450% [4][7]. - The model's occasional self-reference as "Claude" is attributed to the upsampling of recent programming data during pre-training, rather than evidence of distillation from Claude [3][16]. - Kimi K2.5 has demonstrated superior performance in various benchmark tests compared to Claude [16][17]. Group 3: Future Plans for Kimi K3 - Kimi K3 will incorporate more architectural optimizations based on the Kimi Linear framework, with expectations of significant improvements, even if not a tenfold increase in performance [4][21]. - The company is exploring continuous learning to enhance model autonomy and efficiency over time [21][24]. - The focus on maintaining and improving creative writing and emotional understanding capabilities alongside programming skills is a priority for the company [19][20].
轻舟智航L2/L4智驾方案解析:一段式、VLA和世界模型
自动驾驶之心· 2026-01-26 07:16
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 21号,轻舟首个基于单征程6M的城市NOA方案,已正式上车理想L系列智能焕新版。23号轻舟开了一场发布 会,里面技术的部分,给大家分享一下。 单J6M实现一段式端到端+强化学习,说实话是有点东西的。 和大家一起拆解下整体的网络架构: 以上的部分是一个常见的OneModel架构,下面是不一样的地方: 后续利用Safe RL(增加规则的判断)进一步优化自车轨迹。这一套架构整体上来说,其实不复杂,难的是在 J6M 128TOPS的算力上实现。第一时间就有人问柱哥这是不是真的。 DiffusionDrive和Flow Matching已经是多家公司验证过可量产的算法了。有两个算法也推荐一下,Diffusion Planner和Flow Planner,Flow Planner是Diffusion Planner的改进版本,是清华AIR詹仙园老师团队下面的工作。 轻舟也放了几个困难场景的demo。下图是L2实车的表现,严重错位道路和复杂路口的无保护左转,效果都很 不错。严重错位的道路很考验静态的基本功,不止是道路/车道 ...
深度|AI吞噬软件,AI构建AI,来自达沃斯的2026预测
Z Potentials· 2026-01-25 11:03
Core Concept - The article discusses the emerging concept of "Neural Spine," which represents a shift in how organizations perceive and integrate AI into their core operations, moving from AI as a tool to AI as the backbone of the organization [2]. Group 1: Defining AI-Native Companies - Traditional companies focus on optimizing existing workflows with AI, while AI-native companies start with the premise of "what can we create with unlimited intelligence" [3]. - A company is considered AI-driven when three to five core workflows across its business lines are fully executed by AI, moving beyond simple AI applications [3]. Group 2: Measuring Organizational Efficiency - A new metric, Human-to-Agent Ratio, is proposed to measure organizational efficiency, highlighting that some companies operate with a small number of human employees supported by numerous AI agents [4]. - The trend of "Bring Your Own AI" (BYOAI) indicates that individuals are increasingly using AI tools in their work, enhancing productivity and resonating with organizational changes [4][5]. Group 3: The Transformation of Software - The notion that "AI is consuming software" suggests a shift where software becomes less visible, with AI enabling natural language interactions to access software functionalities [8]. - The cost of AI capabilities has dramatically decreased, with the average cost of AI inference dropping by 100 times in the past year, leading to the concept of disposable software [9]. Group 4: Building Trust in AI - Trust is a significant barrier to integrating AI into core business processes, with compliance and governance being major concerns for large enterprises [11]. - Establishing transparency in AI processes is essential for building trust, requiring AI to provide traceable reasoning and decision-making processes [12]. Group 5: Future Predictions for AI - Predictions for the future include AI developing its own models and exhibiting continuous learning capabilities, which could revolutionize how AI is applied in business [13]. - The importance of agent orchestration and understanding the dynamics of multi-agent systems will be critical as AI becomes more integrated into business processes [14]. Group 6: Unique Aspects of China's AI Ecosystem - China's AI ecosystem is characterized by a focus on foundational research and innovation to achieve efficiency, leveraging market scale and user openness [15].
拒绝Reward Hacking!港科联合快手可灵提出高效强化学习后训练扩散模型新范式
机器之心· 2026-01-25 02:35
Core Insights - The article discusses the challenges of using Reinforcement Learning (RL) to fine-tune diffusion models like Stable Diffusion, particularly the issue of "Reward Hacking" which can degrade image quality [2][5] - A new framework called GARDO (Gated and Adaptive Regularization with Diversity-aware Optimization) is introduced, which aims to prevent Reward Hacking while enhancing sample exploration and diversity generation [2][12] Background and Motivation - RL has shown promising results in visual tasks, but defining an ideal reward function is challenging, often leading to the use of proxy rewards that can result in Reward Hacking [5][4] - The article highlights the pitfalls of RL post-training, including low sample efficiency and hindered exploration due to static reference models [9][10] GARDO Framework - GARDO is designed to address the issues of Reward Hacking by implementing three core insights: 1. Gated KL Mechanism, which applies KL regularization only when the model generates samples in unreliable reward regions [14][15] 2. Adaptive Regularization Target, which updates the reference model to prevent optimization stagnation [17] 3. Diversity-Aware Advantage Shaping, which encourages diversity in generated samples to avoid mode collapse [18][19] Experimental Results - GARDO has been tested on various base models (SD3.5-Medium, Flux.1-dev) and demonstrated significant advantages over baseline methods like Flow-GRPO [20][21] - The framework effectively prevents Reward Hacking while maintaining high image quality and sample efficiency, achieving better performance with fewer training steps [22][23] Emergent Behavior - GARDO has shown the ability to generate a higher number of objects in challenging tasks, indicating its potential to unlock new capabilities in visual generation [24][25] Conclusion - The work emphasizes that precise control is more important than strict constraints in visual generation using RL, making GARDO a valuable framework for researchers and developers looking to leverage RL in diffusion models [27]
AI赛车开创世界纪录背后的“弯道”与“换道”
Xin Lang Cai Jing· 2026-01-24 05:10
Core Insights - The AI racing team from Tsinghua University set a world record by completing the 10.77 km Tianmen Mountain course in 16 minutes and 10.838 seconds, showcasing advancements in AI-driven autonomous racing technology [1][3]. Group 1: Technical Challenges and Innovations - The Tianmen Mountain course presents a "composite extreme" testing environment due to satellite signal interruptions, steep slopes, and numerous sharp turns, requiring AI to make precise decisions in milliseconds [3]. - The team developed a dynamic local map loading algorithm to address issues with traditional full-load 3D point cloud maps, enabling real-time high-precision positioning [3][4]. - Data collection methods were enhanced through vehicle-cloud collaboration and a combination of virtual and real-world data, integrating factors like corner entry angles and road conditions into the AI model [3]. Group 2: Learning and Development Pathways - Since 2018, the Tsinghua research team has focused on a new end-to-end autonomous driving approach centered on reinforcement learning, significantly reducing training costs compared to traditional methods reliant on vast amounts of real vehicle data [4]. - The team introduced China's first fully neural network-based end-to-end autonomous driving system, marking a significant technological breakthrough in the industry [4]. Group 3: Real-World Application and Future Directions - The success at Tianmen Mountain serves as a critical test for autonomous technology, emphasizing the need for AI algorithms to be validated in real and extreme scenarios to ensure their effectiveness and robustness [5]. - The developed perception-positioning fusion technology allows vehicles to achieve high real-time and high-precision trajectory estimation, enhancing stability in critical situations [5]. - Despite rapid advancements in autonomous driving technology, there remains a notable gap between AI capabilities and human performance in extreme road conditions, indicating ample opportunities for future research and innovation [5].