Reinforcement Learning - filings, earnings calls, financial reports, news - Reportify

Reinforcement Learning

Search documents

Zai GLM 4.6: What We Learned From 100 Million Open Source Downloads — Yuxuan Zhang, Z.ai

AI Engineer· 2025-11-20 14:14

Model Performance & Ranking - GLM 4.6 is currently ranked 1 on the LMSYS Chatbot Arena, on par with GPT-4o and Claude 3.5 Sonnet [1] - The GLM family of models has achieved over 100 million downloads [1] Training & Architecture - zAI utilized a single-stage Reinforcement Learning (RL) approach for training GLM 4.6 [1] - zAI developed the "SLIME" RL framework for handling complex agent trajectories [1] - The pre-training data for GLM 4.6 consisted of 15 trillion tokens [1] - zAI filters 15T tokens, moves to repo-level code contexts, and integrates agentic reasoning data [1] - Token-Weighted Loss is used for coding [1] Multimodal Capabilities - GLM 4.5V features native resolution processing to improve UI navigation and video understanding [1] Deployment & Integration - GLM models can be deployed using vLLM, SGLang, and Hugging Face [1] Research & Development - zAI is actively researching models such as GLM-4.5, GLM-4.5V, CogVideoX, and CogAgent [1] - zAI is researching the capabilities of model Agents and integration with Agent frameworks like langchain-chatchat and chatpdf [1]

Reinforcement Learning

Reinforcement Learning

Emergent Behavior in Autonomous Driving with Wayve CEO Alex Kendall

Sequoia Capital· 2025-11-18 17:01

Reasoning in the physical world can be really well expressed as a world model. In 2018, we put our very first world model approach on the road. It was a very small 100,000 parameter neural network that could simulate a 30x3 pixel image of a road in front of us.But we were able to use it as this internal simulator to train a modelbased reinforcement learning algorithm. Fast forward to today and we've developed a GIA. It's a full generative world model that's able to simulate multiple camera and sensors and v ...

Autonomous Driving

Emergent Behavior

Reinforcement Learning

Autonomous Driving

Emergent Behavior

Reinforcement Learning

RLinf上新πRL：在线强化学习微调π0和π0.5

机器之心· 2025-11-06 08:58

Core Insights - The article discusses the advancements in the field of robotics, particularly focusing on the VLA models π0 and π0.5 developed by Physical Intelligence, which utilize flow matching techniques to generate high-dimensional and smooth continuous action sequences, demonstrating significant advantages in complex manipulation tasks [2][3]. Group 1: VLA Models and Challenges - VLA models heavily rely on large-scale, high-quality human demonstration data, which is costly and time-consuming to collect and annotate [2]. - Reinforcement learning (RL) allows agents to explore and iteratively improve through real interactions with the environment, reducing the dependency on extensive data and enhancing the performance ceiling of supervised fine-tuning (SFT) [2]. Group 2: πRL Framework - A collaborative effort from institutions like Tsinghua University, Peking University, and CMU has led to the development of the πRL framework for online reinforcement learning fine-tuning of flow matching VLA models [3]. - The πRL framework achieved an average success rate of 97.6% for π0 and 98.3% for π0.5 on the LIBERO testing platform, validating the effectiveness of the fine-tuning approach [3]. Group 3: Technical Innovations - πRL introduces two technical routes: Flow-Noise and Flow-SDE, addressing the challenge of directly calculating the log-likelihood of output actions in flow matching VLA [8][10]. - Flow-Noise models the denoising process as a discrete Markov process, enabling the direct computation of the joint probability density of the denoised sequence [10]. - Flow-SDE combines the denoising process with environmental interaction, constructing a two-layer Markov Decision Process (MDP) [20]. Group 4: Performance Improvements - The πRL framework demonstrated a success rate increase of over 40% across 4,352 grasp-and-place task combinations, achieving final success rates exceeding 80% [3][24]. - In the LIBERO testing platform, πRL improved the average success rate of π0 from 57.6% to 97.6% and π0.5 from 77.1% to 98.3%, surpassing the performance of fully data-trained flow matching VLAs [19]. Group 5: Generalization and Robustness - The πRL algorithm significantly enhances the generalization capabilities of both models in new environments, as evidenced by tests involving domain randomization [26]. - The framework's ability to reduce the average number of steps required to complete tasks indicates improved efficiency compared to supervised fine-tuning [28]. Group 6: Future Directions - Future developments of πRL will include more benchmark tests, deeper analysis of out-of-distribution (OOD) generalization capabilities, and further exploration of critic design for improved stability [35][36].

Reinforcement Learning

Flow Matching VLA

Artificial Intelligence

Reinforcement Learning

Flow Matching VLA

Artificial Intelligence

Nick Szabo· 2025-11-06 05:37

RT TuringPost (@TheTuringPost).@karpathy's nanochat is bigger that you thinkHe calls it a ramp, but it's actually a lab of its own – a miniature system where anyone can experimentAnd most importantly – it’s deeply connected to education, allowing us to understand machine intelligence through a tiny model:1. What is nanochat and how you can use it?It's a miniature LM that costs anything from $100 (~4 hours on an 8XH100 node) to train and behaves like a small, curious creature.Karpathy described it as a “kind ...

Machine Intelligence

Reinforcement Learning

Machine Intelligence

Reinforcement Learning

AI Engineer Code Summit 2025: AIE/CODE Track

AI Engineer· 2025-11-03 21:03

AI Coding Agents & Tools - Focus on building and improving AI coding agents, covering topics from agent reinforcement fine-tuning to proactive agents [1] - Discussions on tools and platforms for AI-assisted coding, including code evaluation, world models for computation, and agent-ready codebases [1] - Exploration of using AI to speed up code execution and address software crisis [1] Reinforcement Learning (RL) in Coding - Research on efficient reinforcement learning and its application in coding environments at scale [1] - Agent Reinforcement Fine Tuning [1] - Building a fast frontier model with RL [1] Future of Software Development - Examination of the future of software development with AI, including continual system-prompt learning and the path towards AGI [1] - Investment trends in the future of software development [1] - Measurement gap between benchmarks and economics in AI capability [1] Codebase Management & Problem Solving - Strategies for solving hard problems in complex codebases [1] - Making Codebases "Agent-Ready" [1] - Transition from code snippets to codebases in coding evaluations [1]

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

Reinforcement Learning

X @Demis Hassabis

Demis Hassabis· 2025-10-30 17:44

RT Tom Zahavy (@TZahavy)I am excited to share a work we did in the Discovery team at @GoogleDeepMind using RL and generative models to discover creative chess puzzles 🔊♟️♟️ #neurips2025🎨While strong chess players intuitively recognize the beauty of a position, articulating the precise elements that constitute creativity remains elusive. To address this, we pre-trained generative models on public datasets and then applied reinforcement learning, using novel rewards designed for uniqueness, counter-intuitiven ...

Reinforcement Learning

Generative Models

Artificial Intelligence

Reinforcement Learning

Generative Models

Artificial Intelligence

X @Wu Blockchain

Wu Blockchain· 2025-10-30 12:03

Project Overview - Avalon Labs 发布白皮书，介绍链上 AI 驱动的 RWA 市场和 AI-MaaS 概念 [1] - Avalon Labs 在 BNB Chain 上启动了强化学习模型 [1] - Avalon Labs 推出了新的 RWA 代币化标准 CRT，授予法律商业权利 [1] Backing & Support - 该项目由 YZi Labs 和 Framework Ventures 支持 [1]

Artificial Intelligence

on-chain AI-powered RWA marketplace

Reinforcement Learning

Artificial Intelligence

on-chain AI-powered RWA marketplace

Reinforcement Learning

The next ‘golden age’ of AI investment

Fortune· 2025-10-30 10:48

Core Insights - The recent Fortune Global Forum in Riyadh highlighted discussions on the transformative impact of artificial intelligence across various industries, featuring prominent speakers from major companies [1] - Anjney Midha from Andreessen Horowitz identified a new "golden age" of investment opportunities in AI, driven by the emergence of innovative frontier teams [2] - Midha emphasized the significance of reasoning models in AI, which enhance problem-solving capabilities by mimicking logical reasoning and reflection [3] - The potential of reinforcement learning in creating multibillion-dollar companies was discussed, particularly when startups deeply understand industry-specific challenges [4] - Despite concerns about a potential AI bubble, investment in the sector continues to surge, with significant funding levels reported [5] Investment Trends - Venture capital investment in generative AI is projected to exceed $73.6 billion in 2025, more than doubling from the previous year, with total investment in the AI ecosystem reaching $110.17 billion, an eightfold increase since 2019 [6] - Major foundation model providers like OpenAI, Anthropic, and Mistral AI are attracting substantial funding, with OpenAI securing $40 billion, Anthropic $13 billion, and Mistral €1.7 billion [7] Industry Developments - The Cyber 60 list, ranking promising cybersecurity startups, showcases new entrants developing tools to combat AI threats, alongside established companies expanding their customer bases [8]

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

Wildfire Defense

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

Wildfire Defense

Cursor发布首个编程大模型！代码生成250tokens/秒，强化学习+MoE架构

量子位· 2025-10-30 01:06

Core Insights - Cursor has officially released its first in-house coding model, named Composer, as part of the Cursor 2.0 update [1][2] - Composer is reported to complete complex tasks in just 30 seconds, achieving a speed increase of 400% compared to competitors [3][12] Model Features - The new Cursor 2.0 includes a native browser tool that allows the model to test, debug, and iterate code autonomously until achieving correct results [4] - Voice code generation enables users to convert their thoughts into code without typing [5] - The interface has shifted from a file-centric to an agent-centric model, allowing multiple agents to run simultaneously without interference [6][7] Performance Metrics - Composer generates code at a speed of 250 tokens per second, which is approximately twice as fast as the current leading models like GPT-5 and Claude Sonnet 4.5 [19][20] - The model demonstrates enhanced reasoning and task generalization capabilities, comparable to mid-tier leading models [21] Training Methodology - Composer's performance is attributed to reinforcement learning, which allows the model to learn from real programming tasks rather than static datasets [22][26] - The training process involves the model working directly within a complete codebase, utilizing production-level tools to write, test, and debug code [27][28] Practical Application - Cursor 2.0 is designed to provide a practical AI system that aligns closely with developers' daily workflows, enhancing its usability in real-world scenarios [35][36] - The model has shown emergent behaviors, such as running unit tests and autonomously fixing code format errors [31] Transparency and Model Origin - There are concerns regarding the transparency of Composer's foundational model, with questions about whether it is based on pre-existing models or entirely self-trained [37][40] - Cursor has previously developed an internal model named Cheetah, which was used for testing speed and system integration [42]

Artificial Intelligence

Reinforcement Learning

Mixture of Experts (MoE)

Artificial Intelligence

Reinforcement Learning

Mixture of Experts (MoE)

单条演示即可抓取一切：北大团队突破通用抓取，适配所有灵巧手本体

3 6 Ke· 2025-10-29 08:55

Core Insights - The article discusses the introduction of the DemoGrasp framework, a novel approach to robotic grasping that addresses challenges in traditional reinforcement learning (RL) methods, particularly in high-dimensional action spaces and complex reward functions [1][4][6]. Group 1: Framework Overview - DemoGrasp is designed to enhance the efficiency of grasping tasks by utilizing a single successful demonstration trajectory as a starting point, allowing for trajectory editing to adapt to various objects and poses [4][8]. - The framework transforms multi-step Markov Decision Processes (MDP) into a single-step MDP based on trajectory editing, significantly improving learning efficiency and performance transfer to real robots [4][6]. Group 2: Learning Process - The learning process involves editing the trajectory of a successful grasp to accommodate new objects, where adjustments to wrist and finger positions are made to fit unseen items [8][12]. - DemoGrasp employs a simulation environment with thousands of parallel worlds to train the policy network, achieving over 90% success rate after 24 hours of training on a single RTX 4090 GPU [8][10]. Group 3: Performance Metrics - In experiments using the DexGraspNet dataset, DemoGrasp outperformed existing methods, achieving a visual policy success rate of 92% with only a 1% generalization gap between training and testing datasets [10][13]. - The framework demonstrated adaptability across various robotic forms, achieving an average success rate of 84.6% on 175 different objects without adjusting training hyperparameters [14][15]. Group 4: Real-World Application - In real-world tests, DemoGrasp successfully grasped 110 unseen objects with a success rate exceeding 90% for regular-sized items and 70% for challenging flat and small objects [15][16]. - The framework supports complex grasping tasks in cluttered environments, maintaining an 84% success rate for single-instance real-world grabs despite significant variations in lighting and object placement [16][17].

Reinforcement Learning

Markov Decision Process

Reinforcement Learning

Markov Decision Process