强化学习(RL)
Search documents
读了 40 篇 VLA+RL之后......
具身智能之心· 2025-11-28 00:04
Core Insights - The article discusses the shift in research trends towards incorporating Reinforcement Learning (RL) in Visual Language Models (VLA), moving beyond Supervised Fine-Tuning (SFT) to enhance model performance and adaptability [1][2]. Group 1: RL Methodologies - Various RL methodologies are categorized, including online RL, offline RL, iterative RL, and inference-time improvement, but the author emphasizes that the effectiveness of these methods is more important than their classification [1]. - The real-world applicability of RL is crucial, with safety and efficiency being key concerns during data collection and model deployment [2]. Group 2: Task Performance and Challenges - Current RL implementations show promising results in single-task performance, with examples like Pi-star-0.6 requiring around 1,000 trajectories for complex tasks such as folding clothes [3]. - A significant challenge remains in enabling RL to handle multiple tasks effectively, ensuring that tasks can positively influence each other rather than detract from overall performance [3]. Group 3: Reward Functions and Research Directions - The necessity of learning reward functions or value functions is debated, with the potential for reduced variance in optimization being a key benefit, although this need may diminish as pre-trained VLA models improve [4][5]. - Research directions are identified, focusing on issues related to sparse rewards, the scale of policy networks, and the multi-task capabilities of RL [5]. Group 4: Literature and Keywords - A list of relevant literature and keywords is provided for further exploration, indicating a rich field of study within RL and VLA [6].
和Ilya想一块去了,马斯克麾下AI大牛出走,要做“会共情”的AI
Sou Hu Cai Jing· 2025-11-26 10:48
Core Insights - A new AI startup, Humans&, is seeking to raise $1 billion with a target valuation of $4 billion, founded by Eric Zelikman, a former researcher at xAI [2][12] - Zelikman aims to develop AI models that learn user behavior and empathize with users, addressing the limitations of current reinforcement learning paradigms [2][17] - The startup's mission is to create AI that better understands human goals and emotions, moving beyond traditional task-oriented models [12][20] Company Overview - Humans& was co-founded by Eric Zelikman, who previously worked at xAI and contributed to the development of significant AI models [4][6] - The company is currently recruiting technical staff, offering competitive salaries starting at $350,000 annually [18] Technology and Innovation - Zelikman has developed the STaR algorithm, which enhances language models' reasoning capabilities, and has been recognized in top AI conferences [11][12] - The focus of Humans& is on creating AI that can collaborate with humans and understand diverse human aspirations and values [17][20] Market Context - The AI industry is shifting towards models that not only possess high intelligence but also emotional intelligence, reflecting a growing demand for more human-like interactions [20]
Ilya重磅发声:Scaling时代终结,自曝不再感受AGI
3 6 Ke· 2025-11-26 06:54
Core Insights - The era of Scaling has ended, and the industry is transitioning into a Research Era [1][3][14] - Current AI models, despite their improvements, lack the generalization capabilities necessary for achieving Artificial General Intelligence (AGI) [3][5][8] - The disconnect between AI model performance in benchmarks and real-world applications is a significant issue [5][6][8] Summary by Sections Transition from Scaling to Research Era - Ilya Sutskever emphasizes that the AI community is moving from a focus on scaling models to a renewed emphasis on research and innovation [1][3][14] - The previous Scaling Era, characterized by increasing data, parameters, and computational power, has reached its limits, necessitating a shift in approach [12][14][15] Limitations of Current AI Models - Despite advancements, current models exhibit poor generalization abilities compared to human intelligence, failing to develop true problem-solving intuition [3][5][8] - Reinforcement Learning (RL) training often leads to over-optimization for specific benchmarks, detracting from the model's overall performance [3][5][6] Importance of Human-Like Learning - Ilya argues that human learning is driven by an intrinsic "value function," which AI currently lacks, leading to less effective decision-making [10][11][12] - The need for AI to incorporate human-like judgment and intuition is highlighted as essential for future advancements [15][18] Future of AI and AGI - Predictions suggest that Superintelligent AI (ASI) could emerge within 5 to 20 years, but its development must be approached cautiously [19][51] - The concept of AGI is redefined, emphasizing the importance of continuous learning rather than a static state of intelligence [28][30][51] Role of Research and Innovation - The industry is expected to see a resurgence of smaller, innovative projects that can lead to significant breakthroughs, moving away from the trend of developing larger models [16][18] - Ilya suggests that the next major paradigm shift may come from seemingly modest experiments rather than grand scaling efforts [18][19] Collaboration and Safety in AI Development - As AI capabilities grow, collaboration among companies and regulatory bodies will become increasingly important to ensure safety and ethical considerations [43][44] - The need for a robustly aligned AI that cares for sentient life is emphasized as a preferable direction for future AI development [48][49]
对话陈锴杰:做你的Personal Agent,更要做你的“高情商Agent”|NEXTA创新夜谈
3 6 Ke· 2025-11-19 07:33
Core Insights - The article discusses the evolution of AI from a "scaling law" approach to an "era of experience," emphasizing the need for AI to learn from real user interactions rather than just relying on large datasets [1][5][6] - Macaron AI, founded by Chen Kaijie, aims to create a "Personal Agent" that understands users' needs and emotions, moving beyond traditional chatbots [1][2] Group 1: Transition from Scaling Law to Experience Era - The AI industry is shifting from relying solely on increasing parameters and data to focusing on learning from real user experiences [1][6] - The "Chinchilla Law" indicates that as model parameters increase, the required data also increases, but the available data is limited, leading to a bottleneck in model intelligence [4][6] - The future competitiveness of intelligent systems will depend on their ability to learn continuously from real experiences rather than just pre-trained data [6][7] Group 2: Reinforcement Learning and Real Feedback - Reinforcement learning (RL) is central to this new approach, where real interactions provide high-quality data that includes causal relationships [2][7] - The success of AI code assistant Cursor illustrates how analyzing user feedback on code suggestions can enhance model performance [2][8] - A robust "Reward Model" evaluates user satisfaction and guides the AI in improving its responses, making the learning process more effective [9][10] Group 3: Macaron AI's Unique Features - Macaron AI has created over 100,000 personalized "mini-apps" for various life scenarios, focusing on being a private and dedicated assistant [3][11] - The memory system of Macaron AI is integrated into the model, allowing it to learn and adapt based on user feedback rather than relying on traditional keyword searches [2][11] - The use of Ant Group's open-source Text Diffusion technology enhances the model's ability to generate and modify content quickly, contributing to a better user experience [12] Group 4: Future of Personal Agents - The vision for personal agents includes the ability to manage various aspects of daily life, such as scheduling, travel, and shopping, potentially replacing many existing applications [16] - The integration of small applications and memory functions is seen as a long-term goal, aiming for a seamless user experience [15]
清华团队:1.5B 模型新基线!用「最笨」的 RL 配方达到顶尖性能
机器之心· 2025-11-12 23:51
Core Insights - The article presents a groundbreaking approach to reinforcement learning (RL) that achieves state-of-the-art (SOTA) performance using a simple, single-stage training method with fixed hyperparameters, resulting in a 50% reduction in computational power [4][14][15] - The findings suggest that a well-scaled, simple baseline can be more powerful than previously thought, challenging the complexity often associated with advanced RL techniques [4][15][27] Background and Context - The research is set against the backdrop of a "technical arms race" in training small models using RL, with various methods evolving rapidly over a few months [6] - Early approaches included hyperparameter tuning, multi-stage progressive training, and curriculum learning, leading to increasingly complex training pipelines [6][8] Methodology - The JustRL approach emphasizes simplicity, utilizing standard GRPO without modifications, a single continuous training phase, and fixed hyperparameters [11] - The training data consists of regular math problem sets without offline difficulty screening or data augmentation, demonstrating effectiveness across different model baselines [11][14] Performance Metrics - JustRL-DeepSeek-1.5B achieved an average accuracy of 54.87% across nine benchmarks, outperforming ProRL-V2, which used a nine-stage training approach [14] - JustRL-Nemotron-1.5B reached an average accuracy of 64.32%, slightly surpassing QuestA, while using significantly fewer tokens [14][15] Training Dynamics - The training process for JustRL-DeepSeek-1.5B was notably stable, with key metrics such as policy entropy and average reward showing healthy fluctuations without typical issues like exploration collapse or premature convergence [17][19] - The training was conducted on 32 A800-80GB GPUs over approximately 15 days, highlighting the reduced engineering complexity and computational overhead compared to multi-stage methods [15] Key Discoveries - The research revealed that adding certain "optimizations" could lead to worse performance, indicating that not all seemingly beneficial techniques are necessary [21][24] - The findings emphasize the importance of establishing a clear, simple baseline to accurately assess the value of complex techniques in RL training [27] Philosophical Implications - The article concludes with a philosophical reflection on the value of simplicity in technology, suggesting that often, simpler methods may yield sufficient results when adequately scaled [26][27][28]
ICCV 2025 Highlight | 大规模具身仿真平台UnrealZoo
机器之心· 2025-11-11 17:11
Core Insights - UnrealZoo is a high-fidelity virtual environment platform designed to enhance research in embodied AI by providing over 100 diverse and realistic 3D scenes, facilitating various research needs [2][5][9] - The platform has been recognized with a Highlight Award at ICCV 2025, indicating its significance in the field [2] Group 1: Platform Features - UnrealZoo includes more than 100 high-quality, realistic scenes ranging from indoor settings to urban landscapes and natural environments, supporting a wide range of research applications [5][13] - The platform features 66 customizable embodied entities, including humans, animals, vehicles, and drones, allowing for interaction with both the environment and other agents [5][24] - It provides an easy-to-use Python interface and tools for data collection, environment enhancement, and distributed training, optimizing rendering and communication efficiency [7][15][42] Group 2: Research Implications - The platform addresses the limitations of existing simulators by offering a diverse and high-fidelity environment that enhances the adaptability and generalization capabilities of embodied agents in complex, dynamic settings [8][9] - Experiments conducted using UnrealZoo demonstrate the importance of environmental diversity in improving the generalization and robustness of agents, particularly in navigation and social interaction tasks [64][55] - The research highlights the challenges faced by current reinforcement learning and visual-language model-based agents in open-world scenarios, emphasizing the need for further development in these areas [8][64] Group 3: Future Directions - Future work will focus on expanding the variety of scenes, entities, and interaction tasks within UnrealZoo to further support the application of embodied AI in real-world scenarios [64]
比NanoBanana更擅长中文和细节控制!兔展&北大Uniworld V2刷新SOTA
量子位· 2025-11-05 05:39
Core Viewpoint - The article introduces UniWorld-V2, a new image editing model that excels in detail and understanding of Chinese language instructions, outperforming previous models like Nano Banana [1][4][6]. Group 1: Model Features - UniWorld-V2 demonstrates superior fine control in image editing, achieving results that surpass those of SFT models [11]. - The model can accurately interpret complex Chinese characters and phrases, showcasing its proficiency in rendering artistic fonts [11]. - Users can specify editing areas through bounding boxes, allowing for precise operations like moving objects out of designated areas [14]. - The model effectively understands commands such as "re-light the scene," integrating objects naturally into the environment with high light and shadow coherence [15]. Group 2: Technical Innovations - The core innovation behind UniWorld-V2 is the UniWorld-R1 framework, which applies reinforcement learning (RL) strategies to image editing [18]. - UniWorld-R1 is the first unified architecture based on RL, utilizing Diffusion Negative-aware Finetuning (DiffusionNFT) for efficient training without likelihood estimation [19]. - The framework employs a multi-modal large language model (MLLM) as a reward model, enhancing the model's alignment with human intentions through implicit feedback [19]. Group 3: Performance Metrics - In benchmark tests, UniWorld-V2 achieved a score of 7.83 in GEdit-Bench, surpassing GPT-Image-1 (7.53) and Gemini 2.0 (6.32) [24]. - The model also led in ImgEdit with a score of 4.49, outperforming all known models [24]. - The method significantly improved the performance of foundational models, with FLUX.1-Kontext's score rising from 3.71 to 4.02, and Qwen-Image-Edit's score increasing from 4.35 to 4.48 [25]. Group 4: Generalization and User Preference - UniWorld-R1 demonstrated strong generalization capabilities, improving FLUX.1-Kontext's score from 6.00 to 6.74 in GEdit-Bench [26]. - User preference studies indicated that participants favored UniWorld-FLUX.1-Kontext for its superior instruction alignment and editing capabilities, despite a slight edge in image quality for the official model [27]. Group 5: Historical Context - UniWorld-V2 builds upon the earlier UniWorld-V1, which was the first unified understanding and generation model, released three months ahead of notable models like Google’s Nano Banana [29].
最火VLA,看这一篇综述就够了
3 6 Ke· 2025-10-31 08:22
Core Insights - The article provides a comprehensive overview of the emerging field of Vision-Language-Action (VLA), highlighting its rapid growth and significance in AI and robotics [1][5]. Summary by Sections VLA Overview - VLA models have seen a dramatic increase in submissions, rising from single digits to 164, marking an 18-fold growth [5]. - A model qualifies as VLA if it uses a pre-trained backbone on large-scale visual-language data, emphasizing capabilities in language understanding, visual generalization, and task transfer [5][6]. Key Trends in VLA - **Trend 1: Efficient Architecture Paradigm** Discrete diffusion models are emerging as a new paradigm, allowing for parallel generation of action sequences, enhancing efficiency and integrating reasoning with actions [7][10]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT emphasizes generating intermediate reasoning steps before actions, improving planning and interpretability, although it relies heavily on high-quality annotated data [11][12]. - **Trend 3: Action Tokenizer** The action tokenizer converts continuous robot actions into discrete tokens that VLMs can understand, bridging the gap between the robot's actions and the VLM's processing [14][16]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced to fine-tune VLA strategies, addressing limitations of imitation learning in extreme scenarios, with notable successes in recent studies [17][18]. - **Trend 5: Efficiency Optimization** Efforts are being made to reduce the hardware requirements for VLA models, making the field more accessible to smaller research labs [19]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [20][23]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation frameworks are being developed to overcome the limitations of existing benchmarks, focusing on meaningful generalization capabilities [24][26]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [30]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for advancing VLA research [31].
英伟达可能要给这个 AI Coding 投 10 亿美金,AI 提升电商交易每月增长 100% 的一个典型案例
投资实习所· 2025-10-31 05:21
Core Viewpoint - Poolside, founded by former GitHub CTO Jason Warner, aims to achieve AGI through software development, positioning OpenAI as its primary competitor, indicating that it is not merely an AI coding product but a foundational model company [1][2]. Funding and Valuation - In October of last year, Poolside secured $500 million in a new funding round, with Nvidia participating, leading to a valuation of approximately $3 billion. This funding is aimed at realizing a larger vision [2]. Product Positioning - Poolside's initial product focus is on creating a generative AI programming platform that automates and enhances software development processes, targeting enterprise clients, particularly those with high data security and privacy requirements, such as government and defense applications [2]. Vision for AGI - By mid-2025, Poolside publicly announced its broader vision of achieving AGI through software development, recognizing the limitations of merely scaling language models. The company emphasizes the importance of reinforcement learning (RL) as a key pathway [6]. Reinforcement Learning as a Key Component - Poolside believes that reinforcement learning (RL) is crucial as it allows models to learn from new experiences and real-world interactions, overcoming the limitations of traditional large language models (LLMs) that rely solely on static text data [7]. Software Engineering and AGI - The company views software engineering as a representative field for general intelligence, providing a rich environment for reinforcement learning and a verifiable reward mechanism. They argue that constructing AGI is about extracting human experience from existing limited data rather than merely increasing the volume of text data fed into larger neural networks [11]. Energy System Analogy - Poolside likens its AGI pathway to an "energy system," with "fusion reactors" extracting energy from existing data and "wind turbines" utilizing RL to gather fresh data generated through learning and exploration [11].
最火VLA,看这一篇综述就够了
量子位· 2025-10-31 04:09
Core Insights - The article discusses the rapid growth and significance of the Vision-Language-Action (VLA) field, highlighting its potential to enable robots to understand human language, perceive the world, and perform tasks effectively [5][6]. Definition and Standards - VLA models must utilize a pre-trained backbone on large-scale visual-language data to qualify as VLA, emphasizing the importance of language understanding, visual generalization, and task transfer capabilities [7][8]. - Models that merely combine separate visual and text encoders are classified as "Multimodal Policies," while Large Behavior Models (LBMs) refer to strategies trained on extensive robot demonstration data [10][12]. Trends in VLA - **Trend 1: Efficient Architecture Paradigms** The emergence of discrete diffusion models allows for parallel generation of action sequences, improving efficiency and performance [14][16]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT enhances robot intelligence by enabling them to generate intermediate reasoning steps before executing actions, improving planning and interpretability [17][18][20]. - **Trend 3: Action Tokenization** This trend focuses on converting continuous robot actions into discrete tokens that VLMs can understand, enhancing efficiency and integration of reasoning with actions [21][24]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced as a fine-tuning tool for VLA strategies, addressing limitations of imitation learning in extreme scenarios [25][26]. - **Trend 5: Efficiency Optimization** Efforts to optimize VLA models aim to reduce costs and hardware requirements, making the field more accessible to smaller research labs [27][28]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [29][35]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation methods are being developed to overcome saturation in existing benchmarks, focusing on future frame prediction and action generation capabilities [36][39]. - **Trend 8: Cross-Modal Learning** Innovations in architecture are essential for developing universal robot strategies that can operate across different action spaces [40][42]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [43][44]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for breakthroughs in VLA research [48][49].