强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

VLA：有人喊“最强解法”，有人说“跑不动”

3 6 Ke· 2025-09-11 08:17

Core Viewpoint - The intelligent driving industry is at a critical juncture with the emergence of VLA (Vision-Language-Action) technology, leading to a division among key players regarding its potential and implementation [1][2][3]. Group 1: VLA Technology and Its Implications - VLA is seen as a potential solution to the limitations of end-to-end systems in intelligent driving, which can only address about 90% of the challenges [6][10]. - The introduction of language as a bridge in the VLA model aims to enhance the system's understanding and decision-making capabilities, allowing for more complex and nuanced driving actions [12][14][18]. - VLA is believed to improve three key areas: understanding dynamic traffic signals, enabling natural voice interactions, and enhancing risk prediction capabilities [19][20][21]. Group 2: Challenges and Criticisms of VLA - Despite the potential advantages, VLA faces significant challenges, including the need for substantial financial investment and the technical difficulties of aligning multimodal data [31][32]. - Critics argue that VLA may not be necessary for achieving higher levels of autonomous driving, with some suggesting it is more of a supplementary enhancement rather than a fundamental solution [35][36]. - The current limitations of existing intelligent driving chips hinder the effective deployment of VLA models, raising concerns about their practical application in real-world scenarios [31][32]. Group 3: Industry Perspectives and Strategies - Companies like Li Auto, Yuanrong, and Xiaopeng are betting on VLA, emphasizing high investment and computational intensity to pursue its development [41][42]. - In contrast, players like Huawei and Horizon are focusing on structural solutions and world models, arguing that these approaches may offer more reliable paths to achieving advanced autonomous driving [43][46]. - The ongoing debate over VLA reflects broader strategic choices within the industry, with companies prioritizing different technological pathways based on their resources and market positioning [47].

VLA（视觉 - 语言 - 动作大模型）

VLA（视觉 - 语言 - 动作大模型）

图灵奖得主理查德·萨顿：人类将开启“宇宙第四大时代”

2 1 Shi Ji Jing Ji Bao Dao· 2025-09-11 05:45

Core Insights - Richard Sutton, the 2024 Turing Award winner, emphasizes the inevitability of AI replacing human roles in the development process of humanity [1][2] - Sutton introduces four realistic "predictive principles" regarding the future of AI, highlighting the need for decentralized collaboration and the importance of experience in learning [2][3] Group 1: AI and Learning - Sutton argues that current machine learning primarily focuses on transferring existing human knowledge to static AI, which lacks autonomous learning capabilities [1][2] - He identifies the need for a new data source generated through direct interaction between intelligent agents and the world, marking the transition into an "experience era" [1][2] - The core of intelligence lies in the ability to predict and control input signals based on experience, which is essential for the development of AI [2] Group 2: Future of AI - Sutton's four predictive principles include the lack of consensus on how the world operates, the potential for humans to understand and create intelligence through technology, the likelihood of superintelligent AI surpassing human intelligence, and the concentration of power and resources among the most intelligent agents [2][3] - He posits that humanity is currently in the "replicator era" and is on the verge of entering the "design era," where AI will play a crucial role [3][4] - Sutton encourages embracing AI as a necessary step in the evolution of the universe, advocating for courage and a spirit of adventure in facing its challenges [4]

Kimi开源又放大招！20秒更新万亿参数的中间件来了

量子位· 2025-09-11 05:19

Core Viewpoint - The article discusses the introduction of a middleware called "checkpoint-engine" that enables the Kimi K2 model, which has one trillion parameters, to update its model weights in approximately 20 seconds across thousands of GPUs, marking a significant advancement in the efficiency of large language model training and inference [6][7]. Group 1: Middleware Functionality - The checkpoint-engine is designed to facilitate the updating of model weights during the inference process of large language models [6]. - It allows for both simultaneous broadcasting of updated weights to all nodes and point-to-point dynamic updates [2][24]. - The middleware supports a pipeline approach for parameter updates, minimizing memory usage by updating parameters one at a time [19][20]. Group 2: System Architecture - Kimi K2 employs a hybrid co-location architecture where the training and inference engines are deployed on the same set of nodes [8]. - During each reinforcement learning iteration, a centralized controller generates new training data using the inference engine and then instructs the training engine to update parameters based on this data [9]. - The system is optimized for high throughput, with each engine deeply optimized for performance [10]. Group 3: Parameter Update Process - The training engine's parameters are unloaded to DRAM, allowing for quick activation of the training engine with minimal data transfer [12]. - The checkpoint engine manages parameter states by first obtaining local parameter copies from the training engine and then broadcasting the complete parameter set to all checkpoint nodes [16][17]. - The inference engine retrieves only the necessary parameter slices from the checkpoint engine, streamlining the update process [18]. Group 4: Performance Optimization - The design sacrifices some data transfer efficiency for a simpler system architecture, which reduces the complexity of maintenance and testing [25][26]. - During the startup of the training engine, nodes selectively read parameters from disk to minimize expensive disk I/O operations [28]. - The checkpoint engine can independently restart in case of failures, enhancing system resilience [33].

大语言模型

Artificial Intelligence

checkpoint-engine（检查点引擎）

大语言模型

Artificial Intelligence

checkpoint-engine（检查点引擎）

交互扩展时代来临:创智复旦字节重磅发布AgentGym-RL，昇腾加持，开创智能体训练新范式

机器之心· 2025-09-11 04:53

Core Insights - The article emphasizes the transition of artificial intelligence from a "data-intensive" to an "experience-intensive" era, where true intelligence is derived from active exploration and experience accumulation in real environments [10][11][50]. - The introduction of the AgentGym-RL framework represents a significant advancement in training autonomous LLM agents for multi-turn decision-making, addressing the limitations of existing models that rely on single-turn tasks and lack diverse interaction mechanisms [12][50]. Group 1: Framework and Methodology - AgentGym-RL is the first end-to-end framework for LLM agents that does not require supervised fine-tuning, supports interactive multi-turn training, and has been validated in various real-world scenarios [3][15]. - The framework integrates multiple environments and rich trajectory data, simplifying complex environment configurations into modular operations, thus facilitating effective experience-driven learning [13][19]. - The ScalingInter-RL method introduces a progressive interaction round expansion strategy, allowing agents to gradually adapt to environments and optimize their interaction patterns, balancing exploration and exploitation [4][23][25]. Group 2: Performance and Results - The research team achieved remarkable results with a 7B parameter model, which demonstrated complex task handling skills such as understanding task objectives and planning multi-step operations after extensive interaction training [5][29]. - In various testing environments, the model not only surpassed large open-source models over 100B in size but also matched the performance of top commercial models like OpenAI o3 and Google Gemini 2.5 Pro [5][29]. - The ScalingInter-RL model achieved an overall accuracy of 26.00% in web navigation tasks, significantly outperforming GPT-4o's 16.00% and matching the performance of DeepSeek-R1-0528 and Gemini-2.5-Pro [29][30]. Group 3: Future Directions - Future research will focus on upgrading general capabilities to enable agents to make efficient decisions in new environments and with unknown tools [51]. - The team aims to expand into more complex scenarios that closely resemble the physical world, such as robotic operations and real-world planning [52]. - There is an intention to explore multi-agent collaboration training models to unlock more complex group decision-making capabilities [52].

智能体训练

ScalingInter-RL

智能体训练

ScalingInter-RL

图灵奖得主理查德·萨顿：人工智能进入“经验时代”，潜力超以往

Bei Ke Cai Jing· 2025-09-11 04:47

Core Insights - Richard Sutton, the 2024 Turing Award winner, emphasized that the human data dividend is nearing its limit, and artificial intelligence is entering an "experience era" centered on continuous learning, which has the potential to exceed previous capabilities [1][2] Group 1: AI and Learning - Sutton stated that most current machine learning aims to transfer existing human knowledge to static AI, which lacks autonomous learning capabilities. He believes we are reaching the limits of human data, and existing methods cannot generate new knowledge, making continuous learning essential for intelligence [2] - He defined "experience" as the interaction of observation, action, and reward, which is crucial for an intelligent agent's ability to predict and control its input signals. Experience is the core of all intelligence [2] Group 2: Collaboration and Future Predictions - Addressing fears about AI causing bias, unemployment, or even human extinction, Sutton argued that such fears are exaggerated and often fueled by those who profit from them. He highlighted that economic systems function best when individuals have different goals and abilities, similar to how decentralized collaboration among intelligent agents can lead to win-win outcomes [3] - Sutton proposed four predictive principles for the future of AI: 1. There is no consensus on how the world should operate, and no single view can dominate [3] 2. Humanity will truly understand intelligence and create it through technology [3] 3. Current human intelligence will soon be surpassed by superintelligent AI or enhanced humans [3] 4. Power and resources will flow to the most intelligent agents [3] Group 3: Historical Context and Future Outlook - Sutton categorized the history of the universe into four eras: the particle era, the star era, the replicator era, and the design era. He believes humanity's uniqueness lies in pushing design to its limits, which is the goal pursued through AI today [4] - He described AI as the inevitable next step in the evolution of the universe, urging society to embrace it with courage, pride, and a spirit of adventure [4] Group 4: Event Overview - The 2025 Inclusion Bund Conference, themed "Reshaping Innovative Growth," took place in Shanghai from September 10 to 13, featuring a main forum, over 40 open insight forums, global theme days, innovation stages, a technology exhibition, and various networking opportunities [4]

图灵奖得主理查德·萨顿2025外滩大会演讲：经验是一切智能的核心与基础

Yang Guang Wang· 2025-09-11 04:06

Core Insights - The 2025 Inclusion Bund Conference opened in Shanghai, featuring a keynote speech by Richard Sutton, the 2024 Turing Award winner and a pioneer in reinforcement learning [1] Group 1: Machine Learning and AI - Sutton emphasized that current machine learning primarily focuses on transferring existing human knowledge to static, non-autonomous AI, reaching the limits of human data [2] - He introduced the concept of the "experience era," advocating for new data sources generated through direct interaction between intelligent agents and the world [2] - Sutton defined "experience" as the interplay of observation, action, and reward, asserting that knowledge is derived from experience, which is fundamental to intelligence [2] Group 2: Future of AI - Sutton proposed four predictive principles regarding the future of AI: 1. There is no consensus on how the world operates, and no single perspective can dominate [3] 2. Humanity will truly understand intelligence and create it through technology [3] 3. Current human intelligence will soon be surpassed by superintelligent AI or enhanced humans [3] 4. Power and resources will gravitate towards the most intelligent agents [3] - He categorized the history of the universe into four eras: particle, star, replicator, and design, asserting that humanity's unique ability to push design to its limits is crucial in the pursuit of AI [3] Group 3: Embracing AI - Sutton stated that artificial intelligence is the inevitable next step in the evolution of the universe, and it should be embraced with courage, pride, and a spirit of adventure [4]

AI跨步进入“经验时代”

Hua Er Jie Jian Wen· 2025-09-11 03:50

Group 1 - The AI industry is transitioning into an "experience era," where continuous learning is essential for intelligence, moving beyond the limitations of human data [2] - Richard Sutton emphasizes that knowledge is derived from experience, which involves observation, action, and reward, and that the intelligence of an agent depends on its ability to predict and control input signals [2] - Two technologies, continual learning and meta-learning, are necessary to unlock the full potential of AI in this new experience era [2] Group 2 - Concerns about AI leading to bias, unemployment, or even human extinction are exaggerated and fueled by certain organizations and individuals profiting from such fears [3] - Sutton argues that decentralized collaboration among agents with different goals can lead to mutual benefits, highlighting human cooperation as a unique strength [3] - He presents four predictive principles regarding the future of AI, including the lack of consensus on how the world should operate and the potential for superintelligent AI to surpass human intelligence [3] Group 3 - Sutton categorizes the history of the universe into four eras: particle, star, replicator, and design, asserting that humanity's unique ability to push design to its limits is crucial in the current pursuit of AI [4] - He believes that AI is an inevitable next step in the evolution of the universe, advocating for a courageous and adventurous approach to its development [5]

“强化学习之父” 理查德·萨顿：人类数据红利逼近极限，AI正进入以持续学习为核心的“经验时代”

Zheng Quan Shi Bao· 2025-09-11 03:50

Core Insights - Richard Sutton, the 2024 Turing Award winner, emphasizes that the human data dividend is nearing its limit, and artificial intelligence is entering an "experience era" centered on continuous learning, which has the potential to exceed previous capabilities [1][2] Group 1: Experience Era - Sutton defines "experience" as the signals of observation, action, and reward that are exchanged between agents and the world, asserting that knowledge derives from experience and that the intelligence of an agent depends on its ability to predict and control its input signals [2] - The transition to the experience era is driven by reinforcement learning, but to fully unlock its potential, two currently immature technologies—continual learning and meta-learning—are required [2] Group 2: Collaboration and AI - Addressing concerns about AI leading to bias, unemployment, or even human extinction, Sutton argues that fears surrounding artificial intelligence are exaggerated, and that decentralized collaboration among different agents can lead to mutually beneficial outcomes [2] - He highlights that humanity's greatest strength lies in collaboration, which has been the foundation of economic, market, and governmental successes [2] Group 3: Future of AI - Sutton posits that the replacement of human roles by AI is inevitable, with humans acting as catalysts and pioneers for the "design era," which he categorizes as the fourth era in the evolution of the universe, following the particle, star, and replicator eras [2][3] - He encourages embracing the evolution of artificial intelligence with courage, pride, and a spirit of adventure [3]

强化学习之父” 理查德·萨顿：人类数据红利逼近极限，AI正进入以持续学习为核心的“经验时代

Zheng Quan Shi Bao Wang· 2025-09-11 03:26

Core Insights - Richard Sutton, the 2024 Turing Award winner, emphasizes that the human data dividend is nearing its limits, and artificial intelligence is entering an "experience era" centered on continuous learning, which has the potential to exceed previous capabilities [1][2] Group 1: Experience Era - Sutton defines "experience" as the interaction of observation, action, and reward, which are signals exchanged between agents and the world [2] - The current machine learning methods are reaching their limits in generating new knowledge, making them unsuitable for continuous learning, which is crucial for intelligence [1][2] Group 2: Technological Advancements - To fully unlock the potential of AI in the experience era, two currently immature technologies are needed: continual learning and meta-learning [2] - Sutton believes that the collaboration between decentralized agents can lead to win-win outcomes, countering fears about AI causing bias, unemployment, or even human extinction [2] Group 3: Human-AI Collaboration - Sutton argues that human collaboration is the greatest success, and AI's role will be to enhance this collaboration, which is fundamental to economic, market, and governmental successes [2] - He posits that AI's replacement of human roles is inevitable, with humans acting as catalysts in ushering in a new "design era" in the evolution of the universe [2] Group 4: Future Perspective - Sutton views artificial intelligence as a necessary next step in the evolution of the universe, advocating for a courageous and adventurous approach to its development [3]

西湖大学最新！ARFM：结合VLA模仿学习与强化学习的优势

具身智能之心· 2025-09-11 02:07

Core Viewpoint - The article discusses the limitations of current visual-language-action (VLA) models in complex tasks and introduces the Adaptive Reinforcement Flow Matching (ARFM) method to enhance their performance by integrating reinforcement learning (RL) capabilities with flow matching advantages [1][2][4]. Summary by Sections Current Status of VLA Models - VLA models based on flow matching have shown excellent performance in general robotic manipulation tasks, validated by large-scale pre-trained systems like RT-1 and PaLM-E, but they struggle with action precision in complex downstream tasks due to reliance on imitation learning [4][5]. Existing Solutions and Limitations - Previous attempts to fine-tune VLA models using offline RL methods, such as ReinboT, have been limited in effectiveness due to the indirect guidance of action prediction, highlighting the need for more effective offline RL fine-tuning methods [4][5]. Main Contributions - The ARFM method is introduced as a novel offline RL post-training approach specifically designed for VLA flow models, addressing the challenges of data quality extraction and improving the efficiency of offline RL fine-tuning [6][7]. Methodological Innovation - ARFM incorporates an adaptive scaling factor in the loss function to balance the advantages of RL while controlling gradient variance, leading to improved generalization, robustness against disturbances, and few-shot learning capabilities [6][8]. Experimental Validation - Extensive experiments on the LIBERO simulation benchmark and the UR5 robotic arm platform demonstrate that ARFM outperforms existing methods in various aspects, including generalization ability, robustness to dynamic disturbances, and efficiency in few-shot learning [6][8][29]. Core Algorithm Design - The ARFM framework is built around energy-weighted loss to integrate RL signals and an adaptive mechanism to ensure training stability, effectively overcoming the limitations of traditional imitation learning and existing offline RL fine-tuning methods [8][11]. Experimental Setup - The experiments utilized the LIBERO benchmark platform, which includes four core task suites, and real-world scenarios with the UR5 robotic arm, focusing on various manipulation tasks under different conditions [29][30]. Key Experimental Results - ARFM demonstrated superior performance in multi-task learning, action perturbation robustness, few-shot learning efficiency, and continual learning capabilities compared to baseline models, confirming its practical value in real-world robotic applications [32][35][38]. Conclusion - The ARFM method effectively balances the retention of RL advantage signals and the control of flow loss gradient variance, leading to enhanced performance in VLA flow models across various tasks and conditions, showcasing its applicability in real-world scenarios [49][47].

自适应强化流匹配（ARFM）方法

自适应强化流匹配（ARFM）方法