RL - filings, earnings calls, financial reports, news

Search documents

自动驾驶之心· 2025-11-28 00:49

Group 1 - The article discusses the transition of live streaming and content acquisition to the "Embodied Intelligence Heart Knowledge Planet" platform [2] - It highlights the high-quality roundtable discussions previously held on topics such as ontology, data, and simulation [2] - The focus of the current discussion is on the VLA algorithm and its implementation with reinforcement learning (RL) [3] Group 2 - Key topics include the pain points of the VLA architecture and model [6] - Exploration of advancements in full-body motion control solutions for robots to improve their performance [6] - Discussion on how to effectively implement VLA with RL on real machines, including board selection and lightweight design [6] Group 3 - The article features notable guests such as the Vice President of Algorithm at Digua Robotics, the Chief Researcher at Beijing Humanoid Robotics, and a partner at Yuanli Lingji [9][11][13] - It also mentions a Tsinghua University PhD who will soon join the Tsinghua Institute of Advanced Studies as an assistant professor [15] Group 4 - The article promotes the availability of in-depth content on the "Embodied Intelligence Heart" knowledge platform, including technical details, Q&A, and exclusive insights [18] - It emphasizes the importance of dexterous hand design as a key technology for closing the "hand-eye-brain" perception loop [18] - The article introduces the concept of "Agent" and its significance in thought, academia, and engineering [18] - It mentions the Spec-VLA framework designed for inference acceleration specific to VLA [18] - The latest developments from CMU on cross-entity world models aiding in small-sample robot learning are also highlighted [18]

具身智能

具身智能之心· 2025-11-24 10:02

Group 1 - The article discusses the recruitment of instructors for courses and projects related to VLA (Variational Learning Algorithms) and RL (Reinforcement Learning) within the community [1] - The community seeks candidates with a research focus on VLA and RL, preferably holding a PhD or currently enrolled in a doctoral program, and having experience in top conferences in the academic field [2] - For industry candidates, practical experience and hands-on debugging experience with real machines are desired [2] Group 2 - The company, "Embodied Intelligence," is identified as the first comprehensive technical exchange community in China, focusing on VLA and RL, and has gathered a large number of students in these fields [3] - The organization offers compensation above the industry average along with abundant industry resources for the recruited instructors [4] - For further details, interested individuals are encouraged to add a specified WeChat contact for inquiries [5]

具身智能之心· 2025-11-21 00:04

Group 1 - The article discusses the recruitment of instructors for courses and projects related to VLA (Variational Learning Algorithms) and RL (Reinforcement Learning) within the community [1] - The community seeks candidates with a research focus on VLA and RL, preferably holding a PhD or currently enrolled in a doctoral program, and having experience in top conferences in the academic field [2] - For industry candidates, practical experience and hands-on debugging experience with real machines are desired [2] Group 2 - The company, referred to as "Embodied Intelligence," is the first comprehensive technical exchange community in China, gathering a large number of individuals focused on VLA and RL [3] - The company offers compensation above the industry average along with abundant industry resources for the recruited instructors [4] - For more detailed information, interested individuals are encouraged to add a specified WeChat contact for consultation [5]

SFT的本质，其实是在优化RL目标的下界...

自动驾驶之心· 2025-10-22 00:03

Core Insights - The article establishes that under sparse rewards, the training objective of Standard Fine-Tuning (SFT) is a loose lower bound of the Reinforcement Learning (RL) objective, and introduces a bridge distribution to tighten this lower bound while maintaining training stability [1][9][23]. Group 1: Relationship Between SFT and RL - The training objective function for RL strategy gradient algorithms is defined, linking SFT and RL through the derivation of the objective function [4][3]. - SFT operates on a fixed set of labeled data, contrasting with RL's online sampling, which optimizes the strategy model based on reward values [5][9]. - The article demonstrates that SFT's optimization goal can be viewed as a lower bound of the RL objective, indicating that SFT training can yield some effectiveness [9][23]. Group 2: Importance Sampling and Adjustments - The article discusses the application of importance sampling to transition from online to offline sampling in the RL training objective [6][11]. - A key finding is that the lower bound of SFT may become looser as training progresses, necessitating adjustments to tighten this bound [9][11]. - The introduction of an auxiliary distribution is proposed to adjust the SFT training objective, allowing for a tighter lower bound while ensuring training stability [11][12]. Group 3: Properties of iw SFT - The iw SFT formulation incorporates a weight coefficient that can be freely adjusted, allowing for the tightening of the lower bound [11][13]. - The choice of the auxiliary distribution is critical; it should be close to the reference distribution to ensure a tight lower bound while maintaining stability [13][14]. - Two methods for constraining importance weights are proposed: clipping the importance weights and smoothing them to reduce variance [14][15]. Group 4: Practical Implications - The article illustrates the advantages of iw SFT through a multi-armed bandit example, showing how it can effectively utilize negative sample information to improve strategy convergence [18][19][20]. - The overall conclusion emphasizes the importance of understanding the relationship between SFT and RL, and how adjustments can enhance training outcomes [23].

后训练的「分」与「合」，SFT&RL 大一统才是正解？

机器之心· 2025-09-14 01:30

Group 1 - The article discusses the limitations of the traditional "SFT followed by RL" paradigm in post-training for AI models, suggesting a unified approach that combines both methods [7][9][10] - It highlights the importance of post-training in aligning the model's capabilities with human values and preferences, addressing the challenges of "catastrophic forgetting" and overfitting associated with SFT [8][11][12] - The emerging trend in the industry is to explore a unified framework for post-training that leverages the strengths of both SFT and RL, rather than treating them as separate processes [10][15][17] Group 2 - The article evaluates the competitive landscape of AI hardware among major players like Meta, OpenAI, Apple, and Google, questioning whether AI hardware will become a new essential or merely a passing trend [2] - It raises questions about the user experience with AI hardware, such as whether it will truly replace traditional devices or simply serve as an additional feature [2][3] - The potential for innovative AI hardware forms to integrate seamlessly into daily life is explored, along with the implications for user interaction and technology adoption [2][3] Group 3 - The article examines the role of generative AI in search, debating whether it will serve as a replacement for traditional search engines or act as a growth engine for expanding user queries and intentions [3] - It discusses how multimodal interactions and conversational AI are redefining task completion for users, potentially enhancing the value of advertising and commercial opportunities [3] - Google's strategy of gradually integrating AI capabilities into its products, rather than waiting for full technological maturity, reflects a proactive approach to product development and market positioning [3]

后训练大一统

SFT

Artificial Intelligence

AI 硬件

后训练大一统

SFT

Artificial Intelligence

AI 硬件

X @Sam Altman

Sam Altman· 2025-08-11 23:05

Performance Improvement - Performance improved from the 49th to the 98th percentile in IOI in one year [1] - The improvement was achieved without training any specialized models [1] - The same Reinforcement Learning (RL) was used as for everything else [1]

Diffusion/VAE/RL 数学原理

自动驾驶之心· 2025-07-29 00:52

Core Viewpoint - The article discusses the principles and applications of Diffusion Models and Variational Autoencoders (VAE) in the context of machine learning, particularly focusing on their mathematical foundations and training methodologies. Group 1: Diffusion Models - The training objective of the network is to fit the mean and variance of two Gaussian distributions during the denoising process [7] - The KL divergence term is crucial for fitting the theoretical values and the network's predicted values in the denoising process [9] - The process of transforming the uncertain variable \(x_0\) into the uncertain noise \(\epsilon\) is iteratively predicted [15] Group 2: Variational Autoencoders (VAE) - VAE assumes that the latent distribution follows a Gaussian distribution, which is essential for its generative capabilities [19] - The training of VAE is transformed into a combination of reconstruction loss and KL divergence constraint loss to prevent the latent space from degenerating into a sharp distribution [26] - Minimizing the KL loss corresponds to maximizing the Evidence Lower Bound (ELBO) [27] Group 3: Reinforcement Learning (RL) - The Markov Decision Process (MDP) framework is utilized, which includes states and actions in a sequential manner [35] - The semantic representation aims to approach a pulse distribution, while the generated representation is expected to follow a Gaussian distribution [36] - Policy gradient methods are employed to enable the network to learn the optimal action given a state [42]

Diffusion Model

VAE

Gaussian distribution

Markov decision process

Diffusion Model

VAE

Gaussian distribution

Markov decision process

专访张祥雨：多模态推理和自主学习是未来的 2 个「GPT-4」时刻

海外独角兽· 2025-06-08 04:51

本期内容是拾象 CEO 李广密对大模型公司阶跃星辰首席科学家张祥雨的访谈。张祥雨专注于多模态领域，他提出了 DreamLLM 多模态大模型框架，这是业内最早的图文生成理解一体化的多模态大模型架构之一，基于这个框架，阶跃星辰发布了中国首个千亿参数原生多模态大模型 Step-1V。此外，他的学术影响力相当突出，论文总引用量已经超过了 37 万次。一直以来，业界都相当期待一个理解、生成一体化的多模态，但直到今天这个模型还没出现，如何才能达到多模态领域的 GPT-4 时刻？这一期对谈中，祥雨结合自己在多模态领域的研究和实践历程，从纯粹的技术视角下分享了自己对多模态领域关键问题的全新思考，在他看来，虽然语言模型领域的进步极快，但多模态生成和理解的难度被低估了： • 接下来 2-3 年，多模态领域会有两个 GPT-4 时刻：多模态推理和自主学习； • o1 范式的技术本质在于激发出 Meta CoT 思维链：允许模型在关键节点反悔、重试、选择不同分支，使推理过程从单线变为图状结构。目录 01 研究主线：重新回归大模型 • 多模态生成理解一体化难以实现的原因在于，语言对视觉的控制能力弱，图文对齐不精确， ...

多模态推理

自主学习

next token prediction

next token prediction

o1 范式

思维链

Agent 开发的上半场: 环境、Tools 和 Context 如何决定 Agent | 42章经

42章经· 2025-04-27 14:10

23 年 4 月以 AutoGPT 为代表的那一波里，Agent 更像是一个玩具，demo 都很炫，但实际应用价值很有限。经过两年的发展，这波 Agent 确实能够在实际的工作和生活场景中解决问题，为大家带来价值了。曲凯： Agent 是当下绝对的风口。关于 Agent 这个话题，我自己有一些核心在思考的问题，相信也是很多人同样会有疑问的地方。所以今天我们请来了长时间对 Agent 有研究和实操的文锋，想就这些问题展开一些讨论。首先我想问，到底怎么定义 Agent？文锋：我认为最好的就是 Anthropic 的定义：Agent 是让模型基于环境反馈去使用工具的一个程序。曲凯：那你怎么看最近这波 Agent 热？文锋：这波 Agent 跟过去非常不一样。之所以会有这种跃迁，一是因为底层模型能力有了很大的进步，尤其是在结合了 RL 之后，以 o1 为代表的模型还赋予了 Agent 长思维能力。二是因为 Agent 的工程侧和产品侧也有很大的突破，主要表现就是大家更知道该怎么给 Agent 构建一个合适的 Context，从而更好地解决问题了。曲凯：怎么理解这个 Context？文锋： ...