Workflow
强化学习(RL)
icon
Search documents
和Ilya想一块去了,马斯克麾下AI大牛出走,要做“会共情”的AI
Sou Hu Cai Jing· 2025-11-26 10:48
Core Insights - A new AI startup, Humans&, is seeking to raise $1 billion with a target valuation of $4 billion, founded by Eric Zelikman, a former researcher at xAI [2][12] - Zelikman aims to develop AI models that learn user behavior and empathize with users, addressing the limitations of current reinforcement learning paradigms [2][17] - The startup's mission is to create AI that better understands human goals and emotions, moving beyond traditional task-oriented models [12][20] Company Overview - Humans& was co-founded by Eric Zelikman, who previously worked at xAI and contributed to the development of significant AI models [4][6] - The company is currently recruiting technical staff, offering competitive salaries starting at $350,000 annually [18] Technology and Innovation - Zelikman has developed the STaR algorithm, which enhances language models' reasoning capabilities, and has been recognized in top AI conferences [11][12] - The focus of Humans& is on creating AI that can collaborate with humans and understand diverse human aspirations and values [17][20] Market Context - The AI industry is shifting towards models that not only possess high intelligence but also emotional intelligence, reflecting a growing demand for more human-like interactions [20]
Ilya重磅发声:Scaling时代终结,自曝不再感受AGI
3 6 Ke· 2025-11-26 06:54
Ilya重磅访谈放出!1个半小时,全程2万字,他爆出惊人观点:Scaling时代已终结,我们正走向研究时代。 Scaling时代已终结! 这一次,Ilya在镜头前平静地说出这句话时,整个AI圈都屏住了呼吸—— 我们已从Scaling时代,正走向研究时代。 在这场与著名主持人Dwarkesh Patel的深度访谈中,Ilya几乎把当前AI研究最刺痛的真相揭开了: 不仅仅是预训练,就连Scaling Law这条路,已经被他判了「缓刑」——还能继续走,但绝不会通向AGI! 他还指出,今天的模型再强,泛化能力也远远配不上其参数量和Benchmark的分数,甚至远逊于人类。 最关键的是,Ilya已对技术缺失的环节形成清晰的思路,但选择暂不公开更多的细节。 这场长达1个半小时对谈,Ilya还探讨了SSI战略、AI模型泛化能力提升关键,以及AGI未来的发展路线。 核心亮点一览: 当前技术路线后劲不足——模型虽持续改进,但无法实现AGI 真正可行的系统架构我们至今尚未掌握构建方法 核心瓶颈:泛化能力,而模型在此领域远逊人类 即便用所有编程竞赛题目训练模型,它仍无法形成真正的「解题直觉」 评估分数光鲜亮丽,但实际性能滞后——R ...
对话陈锴杰:做你的Personal Agent,更要做你的“高情商Agent”|NEXTA创新夜谈
3 6 Ke· 2025-11-19 07:33
Core Insights - The article discusses the evolution of AI from a "scaling law" approach to an "era of experience," emphasizing the need for AI to learn from real user interactions rather than just relying on large datasets [1][5][6] - Macaron AI, founded by Chen Kaijie, aims to create a "Personal Agent" that understands users' needs and emotions, moving beyond traditional chatbots [1][2] Group 1: Transition from Scaling Law to Experience Era - The AI industry is shifting from relying solely on increasing parameters and data to focusing on learning from real user experiences [1][6] - The "Chinchilla Law" indicates that as model parameters increase, the required data also increases, but the available data is limited, leading to a bottleneck in model intelligence [4][6] - The future competitiveness of intelligent systems will depend on their ability to learn continuously from real experiences rather than just pre-trained data [6][7] Group 2: Reinforcement Learning and Real Feedback - Reinforcement learning (RL) is central to this new approach, where real interactions provide high-quality data that includes causal relationships [2][7] - The success of AI code assistant Cursor illustrates how analyzing user feedback on code suggestions can enhance model performance [2][8] - A robust "Reward Model" evaluates user satisfaction and guides the AI in improving its responses, making the learning process more effective [9][10] Group 3: Macaron AI's Unique Features - Macaron AI has created over 100,000 personalized "mini-apps" for various life scenarios, focusing on being a private and dedicated assistant [3][11] - The memory system of Macaron AI is integrated into the model, allowing it to learn and adapt based on user feedback rather than relying on traditional keyword searches [2][11] - The use of Ant Group's open-source Text Diffusion technology enhances the model's ability to generate and modify content quickly, contributing to a better user experience [12] Group 4: Future of Personal Agents - The vision for personal agents includes the ability to manage various aspects of daily life, such as scheduling, travel, and shopping, potentially replacing many existing applications [16] - The integration of small applications and memory functions is seen as a long-term goal, aiming for a seamless user experience [15]
清华团队:1.5B 模型新基线!用「最笨」的 RL 配方达到顶尖性能
机器之心· 2025-11-12 23:51
Core Insights - The article presents a groundbreaking approach to reinforcement learning (RL) that achieves state-of-the-art (SOTA) performance using a simple, single-stage training method with fixed hyperparameters, resulting in a 50% reduction in computational power [4][14][15] - The findings suggest that a well-scaled, simple baseline can be more powerful than previously thought, challenging the complexity often associated with advanced RL techniques [4][15][27] Background and Context - The research is set against the backdrop of a "technical arms race" in training small models using RL, with various methods evolving rapidly over a few months [6] - Early approaches included hyperparameter tuning, multi-stage progressive training, and curriculum learning, leading to increasingly complex training pipelines [6][8] Methodology - The JustRL approach emphasizes simplicity, utilizing standard GRPO without modifications, a single continuous training phase, and fixed hyperparameters [11] - The training data consists of regular math problem sets without offline difficulty screening or data augmentation, demonstrating effectiveness across different model baselines [11][14] Performance Metrics - JustRL-DeepSeek-1.5B achieved an average accuracy of 54.87% across nine benchmarks, outperforming ProRL-V2, which used a nine-stage training approach [14] - JustRL-Nemotron-1.5B reached an average accuracy of 64.32%, slightly surpassing QuestA, while using significantly fewer tokens [14][15] Training Dynamics - The training process for JustRL-DeepSeek-1.5B was notably stable, with key metrics such as policy entropy and average reward showing healthy fluctuations without typical issues like exploration collapse or premature convergence [17][19] - The training was conducted on 32 A800-80GB GPUs over approximately 15 days, highlighting the reduced engineering complexity and computational overhead compared to multi-stage methods [15] Key Discoveries - The research revealed that adding certain "optimizations" could lead to worse performance, indicating that not all seemingly beneficial techniques are necessary [21][24] - The findings emphasize the importance of establishing a clear, simple baseline to accurately assess the value of complex techniques in RL training [27] Philosophical Implications - The article concludes with a philosophical reflection on the value of simplicity in technology, suggesting that often, simpler methods may yield sufficient results when adequately scaled [26][27][28]
ICCV 2025 Highlight | 大规模具身仿真平台UnrealZoo
机器之心· 2025-11-11 17:11
Core Insights - UnrealZoo is a high-fidelity virtual environment platform designed to enhance research in embodied AI by providing over 100 diverse and realistic 3D scenes, facilitating various research needs [2][5][9] - The platform has been recognized with a Highlight Award at ICCV 2025, indicating its significance in the field [2] Group 1: Platform Features - UnrealZoo includes more than 100 high-quality, realistic scenes ranging from indoor settings to urban landscapes and natural environments, supporting a wide range of research applications [5][13] - The platform features 66 customizable embodied entities, including humans, animals, vehicles, and drones, allowing for interaction with both the environment and other agents [5][24] - It provides an easy-to-use Python interface and tools for data collection, environment enhancement, and distributed training, optimizing rendering and communication efficiency [7][15][42] Group 2: Research Implications - The platform addresses the limitations of existing simulators by offering a diverse and high-fidelity environment that enhances the adaptability and generalization capabilities of embodied agents in complex, dynamic settings [8][9] - Experiments conducted using UnrealZoo demonstrate the importance of environmental diversity in improving the generalization and robustness of agents, particularly in navigation and social interaction tasks [64][55] - The research highlights the challenges faced by current reinforcement learning and visual-language model-based agents in open-world scenarios, emphasizing the need for further development in these areas [8][64] Group 3: Future Directions - Future work will focus on expanding the variety of scenes, entities, and interaction tasks within UnrealZoo to further support the application of embodied AI in real-world scenarios [64]
比NanoBanana更擅长中文和细节控制!兔展&北大Uniworld V2刷新SOTA
量子位· 2025-11-05 05:39
Core Viewpoint - The article introduces UniWorld-V2, a new image editing model that excels in detail and understanding of Chinese language instructions, outperforming previous models like Nano Banana [1][4][6]. Group 1: Model Features - UniWorld-V2 demonstrates superior fine control in image editing, achieving results that surpass those of SFT models [11]. - The model can accurately interpret complex Chinese characters and phrases, showcasing its proficiency in rendering artistic fonts [11]. - Users can specify editing areas through bounding boxes, allowing for precise operations like moving objects out of designated areas [14]. - The model effectively understands commands such as "re-light the scene," integrating objects naturally into the environment with high light and shadow coherence [15]. Group 2: Technical Innovations - The core innovation behind UniWorld-V2 is the UniWorld-R1 framework, which applies reinforcement learning (RL) strategies to image editing [18]. - UniWorld-R1 is the first unified architecture based on RL, utilizing Diffusion Negative-aware Finetuning (DiffusionNFT) for efficient training without likelihood estimation [19]. - The framework employs a multi-modal large language model (MLLM) as a reward model, enhancing the model's alignment with human intentions through implicit feedback [19]. Group 3: Performance Metrics - In benchmark tests, UniWorld-V2 achieved a score of 7.83 in GEdit-Bench, surpassing GPT-Image-1 (7.53) and Gemini 2.0 (6.32) [24]. - The model also led in ImgEdit with a score of 4.49, outperforming all known models [24]. - The method significantly improved the performance of foundational models, with FLUX.1-Kontext's score rising from 3.71 to 4.02, and Qwen-Image-Edit's score increasing from 4.35 to 4.48 [25]. Group 4: Generalization and User Preference - UniWorld-R1 demonstrated strong generalization capabilities, improving FLUX.1-Kontext's score from 6.00 to 6.74 in GEdit-Bench [26]. - User preference studies indicated that participants favored UniWorld-FLUX.1-Kontext for its superior instruction alignment and editing capabilities, despite a slight edge in image quality for the official model [27]. Group 5: Historical Context - UniWorld-V2 builds upon the earlier UniWorld-V1, which was the first unified understanding and generation model, released three months ahead of notable models like Google’s Nano Banana [29].
最火VLA,看这一篇综述就够了
3 6 Ke· 2025-10-31 08:22
Core Insights - The article provides a comprehensive overview of the emerging field of Vision-Language-Action (VLA), highlighting its rapid growth and significance in AI and robotics [1][5]. Summary by Sections VLA Overview - VLA models have seen a dramatic increase in submissions, rising from single digits to 164, marking an 18-fold growth [5]. - A model qualifies as VLA if it uses a pre-trained backbone on large-scale visual-language data, emphasizing capabilities in language understanding, visual generalization, and task transfer [5][6]. Key Trends in VLA - **Trend 1: Efficient Architecture Paradigm** Discrete diffusion models are emerging as a new paradigm, allowing for parallel generation of action sequences, enhancing efficiency and integrating reasoning with actions [7][10]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT emphasizes generating intermediate reasoning steps before actions, improving planning and interpretability, although it relies heavily on high-quality annotated data [11][12]. - **Trend 3: Action Tokenizer** The action tokenizer converts continuous robot actions into discrete tokens that VLMs can understand, bridging the gap between the robot's actions and the VLM's processing [14][16]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced to fine-tune VLA strategies, addressing limitations of imitation learning in extreme scenarios, with notable successes in recent studies [17][18]. - **Trend 5: Efficiency Optimization** Efforts are being made to reduce the hardware requirements for VLA models, making the field more accessible to smaller research labs [19]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [20][23]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation frameworks are being developed to overcome the limitations of existing benchmarks, focusing on meaningful generalization capabilities [24][26]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [30]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for advancing VLA research [31].
英伟达可能要给这个 AI Coding 投 10 亿美金,AI 提升电商交易每月增长 100% 的一个典型案例
投资实习所· 2025-10-31 05:21
Core Viewpoint - Poolside, founded by former GitHub CTO Jason Warner, aims to achieve AGI through software development, positioning OpenAI as its primary competitor, indicating that it is not merely an AI coding product but a foundational model company [1][2]. Funding and Valuation - In October of last year, Poolside secured $500 million in a new funding round, with Nvidia participating, leading to a valuation of approximately $3 billion. This funding is aimed at realizing a larger vision [2]. Product Positioning - Poolside's initial product focus is on creating a generative AI programming platform that automates and enhances software development processes, targeting enterprise clients, particularly those with high data security and privacy requirements, such as government and defense applications [2]. Vision for AGI - By mid-2025, Poolside publicly announced its broader vision of achieving AGI through software development, recognizing the limitations of merely scaling language models. The company emphasizes the importance of reinforcement learning (RL) as a key pathway [6]. Reinforcement Learning as a Key Component - Poolside believes that reinforcement learning (RL) is crucial as it allows models to learn from new experiences and real-world interactions, overcoming the limitations of traditional large language models (LLMs) that rely solely on static text data [7]. Software Engineering and AGI - The company views software engineering as a representative field for general intelligence, providing a rich environment for reinforcement learning and a verifiable reward mechanism. They argue that constructing AGI is about extracting human experience from existing limited data rather than merely increasing the volume of text data fed into larger neural networks [11]. Energy System Analogy - Poolside likens its AGI pathway to an "energy system," with "fusion reactors" extracting energy from existing data and "wind turbines" utilizing RL to gather fresh data generated through learning and exploration [11].
最火VLA,看这一篇综述就够了
量子位· 2025-10-31 04:09
Core Insights - The article discusses the rapid growth and significance of the Vision-Language-Action (VLA) field, highlighting its potential to enable robots to understand human language, perceive the world, and perform tasks effectively [5][6]. Definition and Standards - VLA models must utilize a pre-trained backbone on large-scale visual-language data to qualify as VLA, emphasizing the importance of language understanding, visual generalization, and task transfer capabilities [7][8]. - Models that merely combine separate visual and text encoders are classified as "Multimodal Policies," while Large Behavior Models (LBMs) refer to strategies trained on extensive robot demonstration data [10][12]. Trends in VLA - **Trend 1: Efficient Architecture Paradigms** The emergence of discrete diffusion models allows for parallel generation of action sequences, improving efficiency and performance [14][16]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT enhances robot intelligence by enabling them to generate intermediate reasoning steps before executing actions, improving planning and interpretability [17][18][20]. - **Trend 3: Action Tokenization** This trend focuses on converting continuous robot actions into discrete tokens that VLMs can understand, enhancing efficiency and integration of reasoning with actions [21][24]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced as a fine-tuning tool for VLA strategies, addressing limitations of imitation learning in extreme scenarios [25][26]. - **Trend 5: Efficiency Optimization** Efforts to optimize VLA models aim to reduce costs and hardware requirements, making the field more accessible to smaller research labs [27][28]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [29][35]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation methods are being developed to overcome saturation in existing benchmarks, focusing on future frame prediction and action generation capabilities [36][39]. - **Trend 8: Cross-Modal Learning** Innovations in architecture are essential for developing universal robot strategies that can operate across different action spaces [40][42]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [43][44]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for breakthroughs in VLA research [48][49].
最新一篇长达76页的Agentic AI综述
自动驾驶之心· 2025-10-28 00:03
Core Insights - The article discusses the evolution of Agentic AI from pipeline-based systems to model-native paradigms, emphasizing the internalization of reasoning, memory, and action capabilities within the models themselves [1][44]. - It highlights the role of reinforcement learning (RL) as a driving force in transforming static models into adaptive, goal-oriented entities capable of learning from interactions with their environment [1][44]. Background - The rapid advancement of generative AI has primarily focused on reactive outputs, lacking long-term reasoning and environmental interaction. The shift towards Agentic AI emphasizes three core capabilities: planning, tool usage, and memory [3]. - Early systems relied on pipeline paradigms where these capabilities were externally orchestrated, leading to passive models that struggled in unexpected scenarios. The new model-native paradigm integrates these capabilities directly into the model parameters, allowing for proactive decision-making [3][6]. Reinforcement Learning for LLMs - The scarcity of programmatic data and vulnerability to out-of-distribution scenarios necessitate the use of result-driven RL to internalize planning and other capabilities, moving away from prompt-induced behaviors [6][7]. - RL offers advantages over supervised fine-tuning (SFT) by enabling dynamic exploration and relative value learning, transforming models from passive imitators to active explorers [8][9]. Unified Paradigm and Algorithm Evolution - Early RLHF methods excelled in single-turn alignment but struggled with long-term, multi-turn, and sparse rewards. Newer result-driven RL methods like GRPO and DAPO enhance training stability and efficiency [12]. - The evolution of algorithms involves leveraging foundational models to provide priors while refining capabilities through interaction and rewards in task environments [12]. Core Capabilities: Planning - The pipeline paradigm views planning as automated reasoning and action sequence search, which is limited in flexibility and stability under complex tasks [14][15]. - The model-native paradigm integrates planning capabilities directly into model parameters, enhancing flexibility and robustness in open environments [15][18]. Core Capabilities: Tool Usage - Early systems embedded models in fixed nodes, lacking flexibility. The model-native transition internalizes decision-making regarding tool usage, forming a multi-objective decision problem [21][22]. - Challenges remain in credit assignment and environmental noise, which can destabilize training. Modular training approaches aim to isolate execution noise and improve sample efficiency [22]. Core Capabilities: Memory - Memory capabilities have evolved from external modules to integral components of task execution, emphasizing action-oriented evidence governance [27][30]. - Short-term memory utilizes techniques like sliding windows and retrieval-augmented generation (RAG), while long-term memory focuses on external libraries and parameter-based internalization [30]. Future Directions - The trajectory of Agentic AI indicates a shift towards deeper integration between models and their environments, moving from systems designed to use intelligence to those that grow intelligence through experience and collaboration [44].