Workflow
强化学习
icon
Search documents
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
为了在具身环境中解决复杂任务,智能体需要深入理解世界并选择成功的行动。世界模型通过学习从智能体(如机器人或电子游戏玩家)的视角预测潜在行动的 未来结果,为实现这一目标提供了一种有前景的方法。 通过这种方式,世界模型使智能体能够深入理解世界,并具备通过在想象中进行规划或强化学习来选择行动的能力。此外,原则上世界模型可以从固定数据集中 学习,这使得智能体能够纯粹在想象中进行训练,而无需在线交互。对于许多实际应用而言,离线优化行为很有价值,例如物理世界中的机器人,在这种情况 下,与未充分训练的智能体进行在线交互往往不安全。 世界模型智能体 —— 如 Dreamer 3—— 是迄今为止在游戏和机器人领域表现最佳且最为稳健的强化学习算法之一。虽然这些模型在其特定的狭窄环境中速度快且 准确,但其架构缺乏拟合复杂现实世界分布的能力。可控视频模型,如 Genie 3,已在多样的真实视频和游戏上进行训练,并实现了多样的场景生成和简单交互。 这些模型基于可扩展架构,如 diffusion transformer。然而,它们在学习物体交互和游戏机制的精确物理规律方面仍存在困难,这限制了它们在训练成功智能体方面 的实用性。此外,它们 ...
SemiAnalysis创始人Dylan最新访谈--AI、半导体和中美
傅里叶的猫· 2025-10-01 14:43
Core Insights - The article discusses the insights from a podcast featuring Dylan Patel, founder of SemiAnalysis, focusing on the semiconductor industry and AI computing demands, particularly the collaboration between OpenAI and Nvidia [2][4][20]. OpenAI and Nvidia Collaboration - OpenAI's partnership with Nvidia is not merely a financial arrangement but a strategic move to meet its substantial computing needs for model training and operation [4][5]. - OpenAI has 800 million users but generates only $1.5 to $2 billion in revenue, facing competition from trillion-dollar companies like Meta and Google [4][5]. - Nvidia's investment of $10 billion in OpenAI aims to support the construction of a 10GW cluster, with Nvidia capturing a significant portion of GPU orders [5][6]. AI Industry Dynamics - The AI industry is characterized by a race to build computing clusters, where the first to establish such infrastructure gains a competitive edge [7]. - The risk for OpenAI lies in its ability to convert its investments into sustainable revenue, especially given its $30 billion contract with Oracle [6][20]. Model Scaling and Returns - Dylan argues against the notion of diminishing returns in model training, suggesting that significant computational increases can lead to substantial performance improvements [8][9]. - The current state of AI development is likened to a "high school" level of capability, with potential for growth akin to "college graduate" levels [9]. Tokenomics and Inference Demand - The concept of "tokenomics" is introduced, emphasizing the economic value of AI outputs relative to computational costs [10][11]. - OpenAI faces challenges in maximizing its computing capacity while managing rapidly doubling inference demands every two months [10][11]. Reinforcement Learning and Memory Mechanisms - Reinforcement learning is highlighted as a critical area for AI development, where models learn through iterative interactions with their environment [12][13]. - The need for improved memory mechanisms in AI models is discussed, with a focus on optimizing long-context processing [12]. Hardware, Power, and Supply Chain Issues - AI data centers currently consume 3-4% of the U.S. electricity, with significant pressure on the power grid due to the rapid growth of AI infrastructure [14][15]. - The industry is facing labor shortages and supply chain challenges, particularly in the construction of new data centers and power generation facilities [17]. U.S.-China AI Stack Differences and Geopolitical Risks - Dylan emphasizes that without AI, the U.S. risks losing its global dominance, while China is making long-term investments in various sectors, including semiconductors [18][19]. Company Perspectives - OpenAI is viewed positively but criticized for its scattered focus across various applications, which may dilute its execution capabilities [20][21]. - Anthropic is seen as a strong competitor due to its concentrated efforts in software development, particularly in the coding market [21]. - AMD is recognized for its competitive pricing but lacks revolutionary breakthroughs compared to Nvidia [22]. - xAI's potential is acknowledged, but concerns about its business model and funding challenges are raised [23]. - Oracle is positioned as a low-risk player benefiting from its established cloud business, contrasting with OpenAI's high-stakes approach [24]. - Meta is viewed as having a comprehensive strategy with significant potential, while Google is seen as having made a notable turnaround in its AI strategy [25][26].
全新合成框架SOTA:强化学习当引擎,任务合成当燃料,蚂蚁港大联合出品
量子位· 2025-10-01 03:03
Core Insights - The article discusses the launch of PromptCoT 2.0 by Ant Group and the University of Hong Kong, focusing on the direction of task synthesis in the second half of large models [1][5] - The team emphasizes the importance of task synthesis and reinforcement learning as foundational technologies for advancing large models and intelligent agents [6][7] Summary by Sections Introduction to PromptCoT 2.0 - PromptCoT 2.0 represents a comprehensive upgrade of the PromptCoT framework, which was initially introduced a year ago [4][16] - The framework aims to enhance the capabilities of large models by focusing on task synthesis, particularly in the context of complex real-world problems [5][9] Importance of Task Synthesis - Task synthesis is viewed as a critical area that includes problem synthesis, answer synthesis, environment synthesis, and evaluation synthesis [9] - The team believes that without a sufficient amount of high-quality task data, reinforcement learning cannot be effectively utilized [9] Framework and Methodology - The team has developed a general and powerful problem synthesis framework, breaking it down into concept extraction, logic generation, and problem generation model training [10][13] - PromptCoT 2.0 introduces an Expectation-Maximization (EM) cycle to optimize the reasoning chain iteratively, resulting in more challenging and diverse problem generation [15][23] Performance and Data Upgrades - PromptCoT 2.0 has shown significant improvements in performance, allowing strong reasoning models to achieve new state-of-the-art results [17] - The framework has generated 4.77 million synthetic problems, which exhibit higher difficulty and greater differentiation compared to existing datasets [19][20] Future Directions - The team plans to explore agentic environment synthesis, multi-modal task synthesis, and self-rewarding mechanisms to further enhance the capabilities of large models [27][28] - The integration of self-rewarding and game-theoretic approaches is seen as a potential avenue for improving model performance [29]
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
Core Insights - The article discusses the significant advancements in reinforcement learning (RL) techniques that enhance the capabilities of large language models (LLMs), particularly in understanding human intent and following user instructions [2][3] - A comprehensive survey titled "Reinforcement Learning Meets Large Language Models" has been conducted by researchers from top institutions, summarizing the role of RL throughout the entire lifecycle of LLMs [2][3] Summary by Sections Overview of Reinforcement Learning in LLMs - The survey details the application strategies of RL in various stages of LLMs, including pre-training, alignment fine-tuning, and reinforcement reasoning [3][6] - It organizes existing datasets, evaluation benchmarks, and mainstream open-source tools and training frameworks relevant to RL fine-tuning, providing a clear reference for future research [3][6] Lifecycle of LLMs - The article systematically covers the complete application lifecycle of RL in LLMs, detailing the objectives, methods, and challenges faced at each stage from pre-training to reinforcement [11][12] - A classification overview of the operational methods of RL in LLMs is presented, highlighting the interconnections between different stages [5][6] Focus on Verifiable Rewards - The survey emphasizes the focus on Reinforcement Learning with Verifiable Rewards (RLVR), summarizing its applications in enhancing reasoning stability and accuracy in LLMs [7][9] - It discusses how RLVR optimizes the reasoning process and improves the model's adaptability to complex tasks through automatically verifiable reward mechanisms [7][9] Key Contributions - The article identifies three main contributions: a comprehensive lifecycle overview of RL applications in LLMs, a focus on advanced RLVR techniques, and the integration of key research resources essential for experiments and evaluations [9][11] - It provides valuable references for researchers interested in exploring RL in the context of LLMs [11][12] Challenges and Future Directions - Despite significant progress, challenges remain in scalability and training stability for large-scale RL applications in LLMs, which are still computationally intensive and often unstable [12][13] - Issues related to reward design and credit assignment, particularly in long-term reasoning, pose difficulties for model learning [12][13] - The article highlights the need for standardized datasets and evaluation benchmarks to facilitate comparison and validation of RL fine-tuning methods [12][13]
ChatGPT架构师,刚发布了最新研究成果
量子位· 2025-09-30 12:22
Core Insights - The article discusses the latest research from Thingking Machines on an efficient fine-tuning method called LoRA, co-authored by John Schulman, a co-founder of OpenAI [1][3][27]. Group 1: Research Findings - The research titled "LoRA Without Regret" explores the conditions under which LoRA can match the efficiency of full fine-tuning (FullFT) and provides a simplified approach to reduce the difficulty of hyperparameter tuning [3][7]. - Current large models often have trillions of parameters and are trained on vast datasets, but downstream tasks typically require only small datasets focused on specific domains [6]. - LoRA, as a parameter-efficient fine-tuning method, captures fine-tuning information through low-rank matrices, and the research confirms that LoRA can achieve similar performance to FullFT by focusing on key details [7][12]. Group 2: Performance Comparisons - The optimal learning rate for LoRA is found to be ten times that of FullFT, demonstrating its capability to compete effectively in fine-tuning scenarios with medium to small datasets [9][12]. - Experiments using Llama 3 and Qwen3 models on specific datasets showed that high-rank LoRA's learning curves closely align with FullFT, with both exhibiting logarithmic decreases in loss values during training [10][11]. - In mathematical reasoning tasks, even with a rank of 1, LoRA's performance remains comparable to FullFT, highlighting its efficiency in information absorption during training [13][14]. Group 3: Application Insights - The research emphasizes that applying LoRA across all layers of a model, rather than just focusing on attention layers, is crucial for maximizing its performance [15][19]. - Previous studies often limited LoRA's application to attention matrices, but this research indicates that broader application leads to significant performance improvements [16][19]. - The findings suggest that the dominant gradient control lies with layers that have more parameters, necessitating full-layer coverage for LoRA to approach FullFT performance [21]. Group 4: Hyperparameter Tuning - The research team proposes a simplified approach to reduce the complexity of tuning LoRA's hyperparameters, identifying that the optimal learning rate consistently follows a specific pattern [22][25]. - Out of four potential hyperparameters, two are deemed redundant, allowing users to focus on "initial update scale" and "steps of deviation from initial state" to streamline the tuning process [25][26]. - This simplification effectively reduces the tuning difficulty of LoRA by half, making it more accessible for users [26].
印奇的智驾千里路:浪漫可以,但别浪
Guan Cha Zhe Wang· 2025-09-30 09:49
Core Insights - Chongqing has welcomed a new local intelligent driving supplier, Qianli Technology, which aims to establish itself in the smart driving sector and has garnered significant attention from local government and industry leaders [1][3][6]. Group 1: Company Overview - Qianli Technology held a brand launch event on September 28, where it unveiled its brand identity and future plans, indicating strong local government support for its initiatives [3][6]. - The company has ambitious goals, as expressed by its CEO, who stated a desire to capture a significant market share in the intelligent driving sector [3][6]. - Qianli Technology's "Afari Plan" envisions a platform-level AI ecosystem integrating AI, vehicles, and robotics, aiming to expand into both household and industrial AI applications [3][7]. Group 2: Strategic Partnerships and Investments - Qianli Technology has attracted investment from Mercedes-Benz, which invested 1.3 billion RMB, marking a significant step in its international expansion efforts [6][22]. - The company is targeting overseas automotive manufacturers, having previously sought partnerships in Germany, indicating a strategic focus on global markets [6][22]. Group 3: Technological Development - Qianli Technology's current focus includes developing intelligent driving algorithms ranging from L2 to L4, with plans to release L3 by the end of 2025 and L4 by mid-2026 [9][12]. - The company is also working on a new generation of intelligent cockpit systems and aims to establish a comprehensive Robotaxi service within the next 18 months, targeting deployment in over 10 cities globally [9][12]. - The company emphasizes a pragmatic approach to technology, focusing on high "model content" in its intelligent driving solutions, with a goal to enhance this metric significantly in the coming months [14][18]. Group 4: Market Position and Competition - The intelligent driving market is becoming increasingly competitive, with Qianli Technology positioning itself as a strong contender despite being a newer player [6][22]. - The company recognizes the importance of both AI model development and engineering capabilities, suggesting a dual focus on innovation and practical application [22][27]. - With the penetration of L2+ level intelligent driving systems in new car sales exceeding 50% in China, there remains substantial market potential for suppliers to explore [27].
著名机器人专家:人型机器人的未来是不像人
3 6 Ke· 2025-09-30 08:43
Group 1 - The article discusses the challenges faced by humanoid robots in achieving dexterity despite significant investments from venture capital firms and large tech companies [2][3][5] - Humanoid robots are designed to mimic human body structures and perform tasks in human environments, with the goal of creating versatile robots capable of handling various jobs [5][6] - Companies like Tesla and Figure are optimistic about the economic potential of humanoid robots, with predictions of generating trillions in revenue, but the timeline for achieving human-level dexterity remains uncertain [6][7] Group 2 - The history of humanoid robot development spans over six decades, with significant contributions from various researchers and institutions, including early models from Waseda University and Honda [8][9] - Despite advancements, no humanoid robot has demonstrated significant dexterity comparable to human capabilities, and existing designs have not been successfully applied in practical industrial settings [20][21] - The article emphasizes the importance of tactile feedback and dexterity in humanoid robots, arguing that current training methods relying on visual data are insufficient for achieving the desired level of skill [23][24][44] Group 3 - The article critiques the reliance on "learning from demonstration" methods, highlighting the limitations of current approaches that do not incorporate tactile or force feedback [23][24][25] - Companies like Figure and Tesla are shifting towards training humanoid robots using first-person videos of humans performing tasks, betting on the effectiveness of visual learning [26][27] - The article concludes that achieving true dexterity in humanoid robots will require a deeper understanding of tactile perception and the integration of such feedback into training methodologies [44][45]
著名机器人专家:人型机器人的未来是不像人
Core Viewpoint - Despite significant investments from venture capital firms and large tech companies, humanoid robots still struggle to achieve dexterity, which is essential for performing tasks in human environments [2][3][4]. Group 1: Historical Context of Humanoid Robots - The concept of humanoid robots has been explored for over 65 years, with early developments including a computer-controlled robotic arm capable of stacking blocks in 1961 [3]. - The evolution of humanoid robots has seen contributions from various institutions, including WABOT-1 from Waseda University in the 1970s and Honda's ASIMO in 2000 [11][12]. Group 2: Current State and Future Predictions - Humanoid robots are currently in the early stages of development, with Gartner indicating they have not yet reached their peak hype [4]. - Companies like Tesla and Figure are optimistic about the economic potential of humanoid robots, with predictions of creating trillions in revenue [9][10]. Group 3: Challenges in Dexterity - Achieving human-level dexterity in humanoid robots remains a significant challenge, as current robotic hands lack the necessary finesse and adaptability for a wide range of tasks [23][24]. - Existing methods for training robots often rely on visual demonstrations, which do not adequately capture the tactile feedback necessary for dexterous manipulation [27][28]. Group 4: Learning Approaches - The industry has seen a shift towards end-to-end learning methods, where robots learn from observing human actions, but this approach has limitations due to the lack of tactile feedback and precision [30][31]. - Successful applications of end-to-end learning in other fields, such as speech recognition and image labeling, highlight the importance of pre-processing and human-like structures in achieving effective learning outcomes [49][50]. Group 5: Importance of Tactile Feedback - Human dexterity is heavily reliant on rich tactile feedback, which current humanoid robots do not possess, leading to challenges in replicating human-like manipulation [51][52]. - The complexity of human touch perception and the integration of multiple body parts in dexterous tasks further complicate the development of humanoid robots capable of similar actions [52].
DeepSeek新模型降价:优化推理效率,API价格降超50%
YOUNG财经 漾财经· 2025-09-30 06:25
Core Insights - DeepSeek has launched the new DeepSeek-V3.2-Exp model, which significantly reduces API costs by over 50% [2][3][4] Group 1: Model Release and Features - The DeepSeek-V3.2-Exp model is an experimental version that builds on the previous V3.1-Terminus, introducing the DeepSeek Sparse Attention mechanism to enhance training and inference efficiency for long texts [3][4] - The new model maintains performance levels comparable to V3.1-Terminus across various public evaluation datasets, despite the introduction of the sparse attention mechanism [4] Group 2: Cost Reduction and Pricing - The introduction of the new model has led to a substantial decrease in service costs, with API pricing dropping by more than 50%. Specific price changes include input cache hits reduced from 0.5 yuan to 0.2 yuan per million tokens, cache misses from 4 yuan to 2 yuan per million tokens, and output costs from 12 yuan to 3 yuan per million tokens [4] Group 3: Research and Development - The development of the DeepSeek-V3.2-Exp model involved designing new GPU operators and utilizing the TileLang programming language for rapid prototyping, which supports deeper exploration of model capabilities [4] - DeepSeek's research on the DeepSeek-R1 model, which focuses on incentivizing reasoning capabilities in large language models through reinforcement learning, was featured on the cover of the prestigious journal Nature [7]
理想可能发i6战报,可能不发
理想TOP2· 2025-09-30 05:01
Core Viewpoint - The company is likely to release the i6 battle report, but there is a significant chance it may not, with a slightly higher probability leaning towards the release based on recent developments in the industry [1][3]. Group 1: Company Strategy and Market Position - The company is focused on attracting readers who appreciate the analytical value of its insights rather than those seeking non-public information [4]. - The actual operational strategy of the company is driven by the principle of challenging growth limits, which may lead to changes in its product definitions and market approach over time [4]. - The definition of "family car" is broadening, moving away from the previous narrow focus on vehicles suitable for transporting children under 12 years old [4]. Group 2: Product Expectations and Market Dynamics - The i6 is expected to perform significantly better than the L6 in terms of data, but direct comparisons may not be appropriate due to differing market conditions and expectations [5]. - The company is inclined not to release order or large order reports, primarily due to its direct sales model and high level of honesty, which limits the potential for presenting inflated data [4]. - If the data from the i6 proves to be exceptionally strong, there is a possibility that the company will release it to capitalize on the positive market response [4].