强化学习
Search documents
羽毛球机器人如何“看得清”“动得准”?(创新汇)
Ren Min Ri Bao· 2025-06-19 21:51
Group 1 - A new bipedal robot developed by the Swiss Federal Institute of Technology in Zurich can predict the trajectory of a badminton shuttlecock and adjust its position to hit it back to a human opponent, showcasing advanced perception and coordination capabilities [2][3] - The robot's ability to track the shuttlecock relies on a perception noise model that quantifies the impact of its own movements on target tracking, allowing it to adapt to dynamic blurs and occlusions [3][4] - The robot can perform 10 consecutive hits in a single rally with nearly 100% success rate for shots landing in the center of the court, demonstrating its effective coordination of 18 joints through a unified control framework [3][4] Group 2 - The average time for the robot to react from detecting an opponent's hit to executing a swing is approximately 0.35 seconds, indicating room for improvement in its perception and response capabilities [4] - Future enhancements will involve integrating more sensors and optimizing visual algorithms, aiming to extend the robot's applications beyond sports to complex scenarios requiring rapid response and coordination [4] - Bipedal robots are expected to gain traction in various fields such as industrial applications, entertainment, home life, and elder care, driven by advancements in AI and robotics, leading to lower production costs and enhanced functionalities [5]
推荐大模型来了?OneRec论文解读:端到端训练如何同时吃掉效果与成本
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the transformation of recommendation systems through the integration of large language models (LLMs), highlighting the introduction of the "OneRec" system by Kuaishou, which aims to enhance efficiency and effectiveness in recommendation processes [2][35]. Group 1: Challenges in Traditional Recommendation Systems - Traditional recommendation systems face significant challenges, including low computational efficiency, conflicting optimization objectives, and an inability to leverage the latest AI advancements [5]. - For instance, Kuaishou's SIM model shows a Model FLOPs Utilization (MFU) of only 4.6%/11.2%, which is significantly lower than LLMs that achieve 40%-50% [5][28]. Group 2: Introduction of OneRec - OneRec is an end-to-end generative recommendation system that utilizes an Encoder-Decoder architecture to model user behavior and enhance recommendation accuracy [6][11]. - The system has demonstrated a tenfold increase in effective computational capacity and improved MFU to 23.7%/28.8%, significantly reducing operational costs to just 10.6% of traditional methods [8][31]. Group 3: Performance Improvements - OneRec has shown substantial performance improvements in user engagement metrics, achieving a 0.54%/1.24% increase in app usage duration and a 0.05%/0.08% growth in the 7-day user lifecycle (LT7) [33]. - In local life service scenarios, OneRec has driven a 21.01% increase in GMV and an 18.58% rise in the number of purchasing users [34]. Group 4: Technical Innovations - The system employs a multi-modal fusion approach, integrating various data types such as video titles, tags, and user behavior to enhance recommendation quality [14]. - OneRec's architecture allows for significant computational optimizations, including a 92% reduction in the number of key operators, which enhances overall efficiency [27][28]. Group 5: Future Directions - Kuaishou's technical team identifies areas for further improvement, including enhancing inference capabilities, developing a more integrated multi-modal architecture, and refining the reward system to better align with user preferences [38].
从 OpenAI 回清华,吴翼揭秘强化学习之路:随机选的、笑谈“当年不懂股权的我” | AGI 技术 50 人
AI科技大本营· 2025-06-19 01:41
Core Viewpoint - The article highlights the journey of Wu Yi, a prominent figure in the AI field, emphasizing his contributions to reinforcement learning and the development of open-source systems like AReaL, which aims to enhance reasoning capabilities in AI models [1][6][19]. Group 1: Wu Yi's Background and Career - Wu Yi, born in 1992, excelled in computer science competitions and was mentored by renowned professors at Tsinghua University and UC Berkeley, leading to significant internships at Microsoft and Facebook [2][4]. - After completing his PhD at UC Berkeley, Wu joined OpenAI, where he contributed to notable projects, including the "multi-agent hide-and-seek" experiment, which showcased complex behaviors emerging from simple rules [4][5]. - In 2020, Wu returned to China to teach at Tsinghua University, focusing on integrating cutting-edge technology into education and research while exploring industrial applications [5][6]. Group 2: AReaL and Reinforcement Learning - AReaL, developed in collaboration with Ant Group, is an open-source reinforcement learning framework designed to enhance reasoning models, providing efficient and reusable training solutions [6][19]. - The framework addresses the need for models to "think" before generating answers, a concept that has gained traction in recent AI developments [19][20]. - AReaL differs from traditional RLHF (Reinforcement Learning from Human Feedback) by focusing on improving the intelligence of models rather than merely making them compliant with human expectations [21][22]. Group 3: Challenges in AI Development - Wu Yi discusses the significant challenges in entrepreneurship within the AI sector, emphasizing the critical nature of timing and the risks associated with missing key opportunities [12][13]. - The evolution of model sizes presents new challenges for reinforcement learning, as modern models can have billions of parameters, necessitating adaptations in training and inference processes [23][24]. - The article also highlights the importance of data quality and system efficiency in training reinforcement learning models, asserting that these factors are more critical than algorithmic advancements [30][32]. Group 4: Future Directions in AI - Wu Yi expresses optimism about future breakthroughs in AI, particularly in areas like memory expression and personalization, which remain underexplored [40][41]. - The article suggests that while multi-agent systems are valuable, they may not be essential for all tasks, as advancements in single models could render multi-agent approaches unnecessary [42][43]. - The ongoing pursuit of scaling laws in AI development indicates that improvements in model performance will continue to be a focal point for researchers and developers [26][41].
【广发金工】强化学习与价格择时
广发金融工程研究· 2025-06-18 01:33
Core Viewpoint - The article discusses the potential of Reinforcement Learning (RL) in quantitative investment, particularly in developing timing strategies that can maximize cumulative returns through trial and error learning mechanisms [1][2]. Summary by Sections 1. Introduction to Reinforcement Learning - Reinforcement Learning (RL) is a machine learning method that enables decision-making systems to learn optimal actions in specific situations to maximize cumulative rewards. This method is particularly suitable for environments with clear goals but no direct guidance on achieving them [6][12]. 2. Timing Strategy - The article focuses on the Double Deep Q-Network (DDQN) model, which uses 10-minute frequency price and volume data as input. The goal is for the model to learn to provide buy/sell/hold signals at various time points to maximize end-period returns. The backtesting phase outputs timing signals every 10 minutes, adhering to a t+1 trading rule [2][3]. 3. Empirical Analysis - The strategy was tested on various liquid ETFs and stocks from January 1, 2023, to May 31, 2025. The results showed that the strategy generated 72, 30, 73, and 188 timing signals for different assets, with average win rates of 52.8%, 53.3%, 54.8%, and 51.6%, respectively. Cumulative returns outperformed benchmark assets by 10.9%, 35.5%, 64.9%, and 37.8% [3][74][80]. 4. Summary and Outlook - Despite the impressive performance of RL in various fields, challenges such as stability issues remain in the quantitative investment domain. Future reports will explore more RL algorithms to develop superior strategies [5]. 5. Data Description - The timing strategy was applied to the CSI 300 Index, CSI 500 Index, CSI 1000 Index, and a specific stock, utilizing liquid ETFs corresponding to these indices. The training data spanned from January 1, 2014, to December 31, 2019, with validation and testing periods defined [74][75]. 6. Performance Metrics - The performance metrics for the RL timing strategy included total returns, annualized returns, maximum drawdown, annualized volatility, Sharpe ratio, information ratio, and return-to-drawdown ratio, demonstrating the strategy's effectiveness compared to benchmark assets [77][80].
MiniMax开源首个推理模型,456B参数,性能超DeepSeek-R1,技术报告公开
3 6 Ke· 2025-06-17 08:15
Core Insights - MiniMax has launched the world's first open-source large-scale hybrid architecture inference model, MiniMax-M1, with a five-day continuous update plan [2] Model Specifications - The M1 model has a parameter scale of 456 billion, activating 45.9 billion parameters per token, supporting 1 million context inputs and the longest 80,000 token inference output in the industry, which is 8 times that of DeepSeek-R1 [4] - Two versions of the MiniMax-M1 model were trained with thinking budgets of 40k and 80k [4] Training and Cost - The training utilized 512 H800 units over three weeks, costing approximately $537,400 (around 3.859 million RMB), which is an order of magnitude lower than initial cost expectations [7] - The M1 model is available for unlimited free use on the MiniMax app and web [7] API Pricing Structure - The API pricing for M1 is tiered based on input length: - 0-32k input: 0.8 RMB/million tokens input, 8 RMB/million tokens output - 32k-128k input: 1.2 RMB/million tokens input, 16 RMB/million tokens output - 128k-1M input: 2.4 RMB/million tokens input, 24 RMB/million tokens output [7][11] - Compared to DeepSeek-R1, M1's first tier input price is 80% and output price is 50% of DeepSeek-R1's, while the second tier input price is 1.2 times higher [9] Performance Evaluation - MiniMax-M1 outperforms other models like DeepSeek-R1 and Qwen3-235B in complex software engineering, tool usage, and long context tasks [13][14] - In the MRCR test, M1's performance is slightly lower than Gemini 2.5 Pro but better than other models [13] - In the SWE-bench Verified test set, M1-40k and M1-80k perform slightly worse than DeepSeek-R1-0528 but better than other open-source models [14] Technical Innovations - M1 employs a mixed expert (MoE) architecture and a lightning attention mechanism, allowing efficient scaling for long input and complex tasks [16] - The model utilizes large-scale reinforcement learning (RL) for training, with a new CISPO algorithm that enhances performance by optimizing importance sampling weights [16][17] Future Directions - MiniMax emphasizes the need for "Language-Rich Mediator" agents to handle complex scenarios requiring dynamic resource allocation and multi-round reasoning [19]
同一天开源新模型,一推理一编程,MiniMax和月之暗面开卷了
机器之心· 2025-06-17 03:22
Core Insights - The article discusses the launch of new AI models by domestic large model manufacturers, specifically highlighting MiniMax-M1 and Kimi-Dev-72B as significant advancements in the field of open-source AI models [1][9]. Group 1: MiniMax-M1 - MiniMax-M1 is introduced as a long-context reasoning LLM capable of handling an input of 1 million tokens and an output of 80,000 tokens, making it one of the most powerful models in terms of context length [2][19]. - The model demonstrates exceptional capabilities in interactive applications, such as creating web applications and visualizing algorithms, with a focus on user-friendly UI components [5][8]. - MiniMax-M1 has been trained using a novel reinforcement learning algorithm called CISPO, which optimizes model performance by focusing on important sampling weights rather than token updates, achieving faster convergence compared to previous methods [20][23]. - The model's performance in various benchmarks shows it surpasses other open-weight models, particularly in software engineering and long-context tasks, with a notable score of 56.0% on the SWE-bench Verified benchmark [29][25]. Group 2: Kimi-Dev-72B - Kimi-Dev-72B is presented as a powerful open-source programming model that achieved a new state-of-the-art (SOTA) score of 60.4% on the SWE-bench Verified benchmark, showcasing its capabilities in code generation [10][37]. - The model employs a collaborative mechanism between BugFixer and TestWriter roles, enhancing its ability to fix bugs and write tests effectively [40][45]. - Kimi-Dev-72B underwent extensive mid-training using high-quality real-world data, which significantly improved its performance in practical error correction and unit testing [41][42]. - The model's design includes a unique outcome-based reward mechanism during reinforcement learning, ensuring that only effective code fixes are rewarded, thus aligning with real-world development standards [43][44].
性能比肩DeepSeek-R1,MiniMax仅花380万训出推理大模型性价比新王|开源
量子位· 2025-06-17 01:03
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 国产推理大模型又有重磅选手。 MiniMax开源 MiniMax-M1 ,迅速引起热议。 这个模型有多猛?直接上数据: MiniMax团队透露,只用了3周时间、512块H800 GPU就完成强化学习训练阶段,算力租用成本仅 53.47万美元 (约383.9万元)。 不仅如此,在多个基准测试上MiniMax-M1的表现可比或超越DeepSeek-R1、Qwen3等多个开源模型,在工具使用和部分软件工程等复杂任 务上甚至超越了OpenAI o3和Claude 4 Opus。 MiniMax-M1实战表现如何?官方给出了一句话生成迷宫小游戏的Demo。 创建一个迷宫生成器和寻路可视化工具。随机生成一个迷宫,并逐步可视化 A* 算法的求解过程。使用画布和动画,使其具有视觉吸引 力。 目前模型权重已可在HuggingFace下载,技术报告同步公开。 原生支持100万token的输入长度,是DeepSeek R1的约8倍。 同时支持8万输出token,超过Gemini 2.5 Pro的6.4万,成为 世界最长输出 。 生成10万token时,推理算力只需要DeepSe ...
AI将受困于人类数据
3 6 Ke· 2025-06-16 12:34
Core Insights - The article discusses the transition from the "human data era" to the "experience era" in artificial intelligence, emphasizing the need for AI to learn from first-hand experiences rather than relying solely on human-generated data [2][5][10] - Richard S. Sutton highlights the limitations of current AI models, which are based on second-hand experiences, and advocates for a new approach where AI interacts with its environment to generate original data [6][7][11] Group 1: Transition to Experience Era - The current large language models are reaching the limits of human data, necessitating a shift to real-time interaction with environments to generate scalable original data [7][10] - Sutton draws parallels between AI learning and human learning, suggesting that AI should learn through sensory experiences similar to how infants and athletes learn [6][8] - The experience era will require AI to develop world models and memory systems that can be reused over time, enhancing sample efficiency through high parallel interactions [3][6] Group 2: Decentralized Cooperation vs. Centralized Control - Sutton argues that decentralized cooperation is superior to centralized control, warning against the dangers of imposing single goals on AI, which can stifle innovation [3][12] - The article emphasizes the importance of diverse goals among AI agents, suggesting that a multi-objective ecosystem fosters innovation and resilience [3][12][13] - Sutton posits that human and AI prosperity relies on decentralized cooperation, which allows for individual goals to coexist and promotes beneficial interactions [12][14][16] Group 3: Future of AI Development - The development of fully intelligent agents will require advancements in deep learning algorithms that enable continuous learning from experiences [11][12] - Sutton expresses optimism about the future of AI, viewing the creation of superintelligent agents as a positive development for society, despite the long-term nature of this endeavor [10][11] - The article concludes with a call for humans to leverage their experiences and observations to foster trust and cooperation in the development of AI [17]
九章云极发布智算云2.0,赋能千行百业
Jing Ji Wang· 2025-06-16 09:35
6月16日,九章云极DataCanvas正式发布新一代全栈智能计算云平台——九章智算云Alaya NeW Cloud 2.0,并同步启动全球首个强化学习智算服务。该平台基于Serverless技术架构与强化学习技术的 深度融合,成功突破"秒级生成百万token级"的性能瓶颈,旨在为全球AI创新企业及研发机构提供智能 计算基础设施级服务。 九章智算云平台Alaya NeW Cloud 2.0专注于计算密集型应用,创新性地提供高度融合的智能计算基 础设施(AI Infra)与低门槛工具链(Tools)。实测数据显示,平台可实现万卡级至十万卡级规模的异 构算力统一调度;针对MoE模型架构,推理优化效率提升数倍;支持用户通过单行代码操作即可完成分 布式工作负载编排;独创的"按实际资源消耗精准计量计费"的创新计价模型,显著降低了用户使用成本 与应用门槛。 九章云极DataCanvas公司董事长方磊表示:"从移动互联网'带宽式应用'到AI时代'计算密集型应 用'的结构性变革,亟需新型云架构支撑。九章智算云Alaya NeW Cloud 2.0通过'高度融合的高密度AI Infra + 低门槛工具链Tools'的范式重构, ...
AI将受困于人类数据
腾讯研究院· 2025-06-16 09:26
Core Viewpoint - The article discusses the transition from the "human data era" to the "experience era" in artificial intelligence, emphasizing the need for AI to learn from first-hand experiences rather than relying solely on human-generated data [1][5][12]. Group 1: Transition to Experience Era - AI models currently depend on second-hand experiences, such as internet text and human annotations, which are becoming less valuable as high-quality human data is rapidly consumed [1][5]. - The marginal value of new data is declining, leading to diminishing returns despite the increasing scale of models, a phenomenon referred to as "scale barriers" [1][5]. - To overcome these limitations, AI must interact with its environment to generate first-hand experiences, akin to how infants learn through play or athletes make decisions on the field [1][5][8]. Group 2: Technical Characteristics of the Experience Era - In the experience era, AI agents need to operate continuously in real or high-fidelity simulated environments, using environmental feedback as intrinsic reward signals rather than human preferences [2][5]. - The development of reusable world models and memory systems is crucial, along with significantly improving sample efficiency through high parallel interactions [2][5]. Group 3: Philosophical and Governance Implications - The article highlights the superiority of decentralized cooperation over centralized control, warning against the dangers of imposing single objectives on AI, which mirrors historical attempts to control human behavior out of fear [2][5][18]. - A diverse ecosystem of multiple goals fosters innovation and resilience, reducing the risks of single points of failure and rigidity in AI governance [2][5][18]. Group 4: Future Perspectives - The evolution of AI is seen as a long-term journey requiring decades of development, with the success hinging on stronger continuous learning algorithms and an open, shared ecosystem [5][12]. - The article posits that the creation of superintelligent agents and their collaboration with humans will ultimately benefit the world, emphasizing the need for patience and preparation for this transformation [12].