Workflow
强化学习
icon
Search documents
联合理解生成的关键拼图?腾讯发布X-Omini:强化学习让离散自回归生成方法重焕生机,轻松渲染长文本图像
机器之心· 2025-08-10 04:31
Core Insights - The article discusses the advancements in image generation technology, particularly focusing on the X-Omni model developed by Tencent's team, which significantly enhances the quality of autoregressive image generation through reinforcement learning [2][4][5]. Group 1: Model Development - The X-Omni model utilizes reinforcement learning to improve the aesthetic quality of generated images and its ability to follow complex instructions, showcasing superior performance in rendering long texts [5][6]. - The model architecture is based on discrete tokens and employs a diffusion decoder to generate images, allowing for a unified approach to visual understanding and generation [6][11]. Group 2: Reinforcement Learning Approach - The reinforcement learning process incorporates a comprehensive reward model that evaluates image generation quality from multiple dimensions, including human aesthetic preferences and text-image semantic alignment [9][12]. - The introduction of the GRPO reinforcement learning method enhances the model's image generation capabilities, demonstrating that RL optimization surpasses traditional supervised fine-tuning methods [8][19]. Group 3: Performance Evaluation - The X-Omni model outperforms existing models in various benchmarks, achieving high scores in both text rendering and instruction-following capabilities, with scores of 0.901 in English and 0.895 in Chinese for text rendering [13][14]. - In instruction-following assessments, X-Omni achieved an overall score of 87.65, indicating its effectiveness in understanding and executing complex prompts [14]. Group 4: Unique Findings - Unlike traditional autoregressive models that rely heavily on classifier-free guidance (CFG) to enhance generation quality, X-Omni can produce high-quality images without CFG, demonstrating a high degree of integration between visual and language generation mechanisms [17]. - The research highlights the unique advantages of reinforcement learning in image generation, providing more comprehensive and efficient optimization signals compared to conventional methods [19].
二段式SOTA!港科大FiM:从Planning的角度重新思考轨迹预测
自动驾驶之心· 2025-08-09 16:03
Core Insights - The article presents a novel approach to trajectory prediction in autonomous driving, emphasizing a "First Reasoning, Then Forecasting" strategy that integrates intention reasoning to enhance prediction accuracy and reliability [2][4][48]. Group 1: Methodology - The proposed method introduces an intention reasoner based on a query-centric Inverse Reinforcement Learning (IRL) framework, which captures the behavior of traffic participants and their intentions in a compact representation [2][6][48]. - A bidirectional selective state space model (Bi-Mamba) is developed to improve trajectory decoding, effectively capturing the sequential dependencies of trajectory states [7][9][48]. - The framework utilizes a grid-level graph to represent the driving context, allowing for efficient modeling of participant behavior and intentions [5][6][20]. Group 2: Experimental Results - Extensive experiments on large datasets such as Argoverse and nuScenes demonstrate that the proposed method significantly enhances prediction confidence and achieves competitive performance compared to state-of-the-art models [9][34][38]. - In the Argoverse 1 dataset, the proposed method (FiM) outperformed several strong baseline methods in key metrics such as Brier score and minFDE6, indicating its robust predictive capabilities [34][35]. - The results from Argoverse 2 further validate the effectiveness of the intention reasoning strategy, showing that longer-term intention supervision improves prediction reliability [36][37]. Group 3: Challenges and Innovations - The article highlights the inherent challenges in modeling intentions due to the complexity of driving scenarios, advocating for the use of large reasoning models (LRMs) to enhance intention inference [5][6][12]. - The integration of a dense occupancy grid map (OGM) prediction head is introduced to model future interactions among participants, which enhances the overall prediction performance [7][25][41]. - The study emphasizes the importance of intention reasoning in motion prediction, establishing a promising baseline for future research in trajectory prediction [48].
史上最大高质量科学推理后训练数据集开源,快速让Qwen3等变“科学家”
量子位· 2025-08-09 07:01
Core Viewpoint - The release of MegaScience, a large-scale open-source dataset for scientific reasoning, aims to enhance the training and evaluation of general artificial intelligence systems in scientific domains, addressing the lack of high-quality training data in scientific reasoning tasks [1][9][15]. Group 1: Dataset Overview - MegaScience consists of approximately 1.25 million question-answer pairs across various disciplines, including biology, chemistry, computer science, economics, mathematics, medicine, and physics [1][15]. - The dataset has been downloaded over 4,600 times within a week of its release and ranks fourth on the HuggingFace Datasets Trending list, indicating significant interest from the academic and industrial research communities [7]. Group 2: Performance and Evaluation - Models trained on MegaScience significantly outperform corresponding official Instruct models in scientific reasoning tasks, demonstrating the dataset's effectiveness [3][16]. - The dataset exhibits good scalability, with performance gains becoming more pronounced as the size of the base models increases [3][16]. Group 3: Challenges Addressed - Existing scientific reasoning datasets face challenges such as unreliable benchmark evaluations, inadequate decontamination processes, low-quality reference answers, and superficial knowledge distillation [10][11][13]. - MegaScience addresses these challenges through a systematic approach, including the development of a comprehensive scientific reasoning evaluation framework and rigorous data decontamination processes [13][15]. Group 4: Data Construction Process - The construction of MegaScience involved collecting data from multiple public datasets, implementing deduplication and decontamination strategies, and applying various data selection techniques to ensure high-quality outputs [27][28][30]. - The TextbookReasoning dataset, a component of MegaScience, was created using a fully automated process that extracted and refined question-answer pairs from approximately 120,000 university-level textbooks [14][19][20]. Group 5: Evaluation Framework - The evaluation framework for MegaScience includes 15 representative benchmark tasks designed to comprehensively assess the scientific reasoning capabilities of language models [37][39]. - The framework optimizes answer extraction processes to enhance the accuracy of evaluation results, ensuring a fair comparison between models [39][41]. Group 6: Future Prospects - Future research may explore the integration of reinforcement learning with MegaScience to further enhance scientific reasoning capabilities, leveraging the high-quality reference answers provided by the dataset [47][48].
理想VLA含金量分析与关键迭代方向预测
理想TOP2· 2025-08-09 06:18
Core Viewpoint - The article emphasizes the innovative capabilities of Li Auto's VLA (Vision Language Architecture) and its potential to significantly enhance autonomous driving technology through a combination of AI software and hardware integration, led by the company's founder, Li Xiang [2][3][4]. Group 1: Innovation and Technology - Li Auto's VLA represents a significant innovation at the MoE (Mixture of Experts) level, with a focus on original architecture and execution, drawing from contributions across the AI community [2]. - The integration of AI software with hardware has reached an industry-leading level, with a clear distinction between the rapid iteration capabilities of software and the slower evolution of hardware [3]. - The core of Li Auto's VLA is based on reinforcement learning, which allows for a more effective learning process compared to traditional imitation learning, enhancing the vehicle's decision-making capabilities [9][10]. Group 2: Leadership and Vision - Li Xiang plays a crucial role in the development of Li Auto's autonomous driving technology, similar to Elon Musk's influence at Tesla, ensuring the company remains adaptable to industry changes and resource allocation [4][5]. - The ability of Li Xiang to make key judgments regarding resource distribution and AI learning is vital for the company's long-term success and efficient resource utilization [4]. Group 3: Future Directions and Predictions - Key iterative directions for Li Auto's VLA include improving the speed, quality, and cost-effectiveness of simulation data, which is essential for reinforcement learning [8][12]. - The company aims to maximize the potential of existing vehicle hardware for autonomous driving while also exploring new chip technologies to enhance computational capabilities [13]. - Future advancements may involve online learning architectures that allow for real-time weight updates, significantly improving the model's adaptability and understanding of the physical world [13].
对话千寻智能高阳:科学家创业不太“靠谱”,但创业就像一场游戏
3 6 Ke· 2025-08-08 01:49
Core Viewpoint - The article discusses the emergence of embodied intelligence in robotics, emphasizing the importance of creating integrated hardware and software solutions, akin to Apple's approach, rather than a fragmented one like Android's [5][6]. Group 1: Company Overview - Qianxun Intelligent, co-founded by Gao Yang and Han Fengtao, has raised over 1 billion RMB in funding within 19 months, with investors including Huawei Hubble, JD.com, and CATL [4]. - Gao Yang, a former assistant professor at Tsinghua University, transitioned from academia to entrepreneurship, highlighting the challenges and learning experiences in this shift [5][12]. Group 2: Market Insights - The robotics market is currently competitive, with established companies focusing on hardware while neglecting the software aspect, which Gao Yang believes is crucial for long-term success [9]. - The potential for embodied intelligence is seen as inevitable, driven by advancements in AI technologies like ChatGPT, which have shifted perceptions about the capabilities of AI [8]. Group 3: Technical Perspectives - The integration of hardware and software is deemed essential in the early stages of robotics development, as seen in historical examples like IBM's approach to personal computers [6][7]. - Gao Yang emphasizes the importance of algorithms and data in evaluating the performance of robotic systems, noting that models must be capable of handling complex tasks rather than just simple ones [28][29]. Group 4: Future Outlook - The anticipated development of robots capable of performing complex tasks, referred to as Robot GPT-3.5, is expected to significantly enhance their functionality in everyday scenarios [32]. - The article suggests that the current focus on large-scale data collection in robotics may not be as valuable due to the rapid evolution of robot forms, indicating a need for more effective pre-training methods [41][42].
字节&MAP重塑大模型推理算法优化重点,强化学习重在高效探索助力LLM提升上限
量子位· 2025-08-07 10:13
Core Viewpoint - The article discusses the limitations of traditional reinforcement learning (RL) frameworks in large language models (LLMs), particularly the issue of premature convergence leading to a lack of exploration and diversity in generated outputs [1][2]. Group 1: Introduction to FR3E - The FR3E framework, inspired by the concept of "First Return, Then Explore," aims to address the exploration challenges in RL by balancing exploitation and exploration [2][4]. - This new structured exploration framework is developed by a collaborative team from ByteDance, MAP, and the University of Manchester [2][5]. Group 2: Algorithm Framework - The FR3E algorithm consists of two phases: First Return and Entropy-Eliciting Explore [10][14]. - In the First Return phase, the model performs multiple rollouts for each prompt, exploring potential solutions and collecting trajectories and reward signals [12]. - The Entropy-Eliciting Explore phase utilizes a dynamic advantage modulation mechanism to fine-tune learning signals based on the marginal improvement in value from one state to another [16][18]. Group 3: Data Construction - The team employs a mixed difficulty strategy for data construction, using low-difficulty data for stable training and high-difficulty data to challenge the model's reasoning capabilities [23]. Group 4: Experimental Results - The effectiveness of FR3E was evaluated across several authoritative mathematical reasoning benchmarks, including GSM8K, Math500, and others, using various model sizes [24]. - FR3E outperformed the strong baseline GRPO++ across multiple benchmarks, demonstrating superior generalization and reasoning capabilities [25][28]. - Notably, FR3E exhibited prolonged exploration behavior, with slower entropy decay and longer response lengths, successfully overcoming the "stagnation" issue seen in traditional methods [26][27]. Group 5: Conclusion - FR3E presents an innovative and efficient structured exploration paradigm that directly addresses the core bottleneck of insufficient exploration in LLMs [28]. - The method's principles of "structured feedback + adaptive adjustment" show promising scalability and potential for future RL training in large models [29].
强化学习+MCP=王炸?开源框架教AI在MCP中玩转工具解决任务,实测效果超越GPT!
量子位· 2025-08-07 10:13
Core Viewpoint - The article discusses the introduction of OpenPipe's new open-source reinforcement learning framework, MCP·RL, which allows agents to autonomously discover tools, generate tasks, and learn optimal strategies through closed-loop feedback without extensive manual configuration [2][14][23]. Group 1: MCP·RL Overview - MCP·RL enables agents to automatically connect to an MCP Server, discover available tools, and generate training tasks based on tool information [18]. - The framework achieves state-of-the-art (SOTA) performance in two-thirds of benchmark tests, demonstrating its effectiveness [4][21]. - Unlike traditional methods that require extensive setup, MCP·RL simplifies the process by allowing the model to learn from experience without the need for data annotation or custom MCP interfaces [23][24]. Group 2: Learning Process - The training process of MCP·RL consists of four steps: discovering tools, generating tasks, learning how to use tools, and testing the effectiveness of the strategies [18][19]. - The framework emphasizes a "learning by doing" approach, where agents learn through practical experience rather than predefined configurations [7][14]. - The transition from using MCP to having AI utilize MCP signifies a significant shift in how agents interact with tools [20]. Group 3: Practical Applications - MCP·RL is designed to be applicable to any server and is ready to use out of the box, making it versatile for various applications [23]. - The Agent Reinforcement Trainer (ART) component of MCP·RL allows for real-world training and evaluation of agent strategies, enhancing reliability [24][25]. - Previous tests with ART on the Qwen 2.5-14B model showed superior performance in email retrieval tasks, achieving SOTA results [26].
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of reinforcement learning techniques in the post-training phase of large language models (LLMs), highlighting the introduction of Group Sequence Policy Optimization (GSPO) as a solution to the instability issues associated with Group Relative Policy Optimization (GRPO) [2][10][31]. Group 1: Training Phases and Techniques - The training of large language models typically consists of two phases: pre-training and post-training, where the latter focuses on improving the model's understanding and execution of human instructions [1]. - The post-training phase employs reinforcement learning, with initial methods like Reinforcement Learning from Human Feedback (RLHF) being time-consuming and costly due to reliance on human annotators [2][3]. Group 2: Innovations and Comparisons - DeepSeek introduced an automated approach to RLHF, significantly reducing costs and improving efficiency by allowing the model to learn through reward signals rather than manual evaluations [2]. - The DeepSeek team proposed the Group Relative Policy Optimization (GRPO) algorithm, which they believe is more effective than the Proximal Policy Optimization (PPO) used by OpenAI in ChatGPT [3][5]. Group 3: Issues with GRPO - The Qwen team identified serious stability issues with GRPO, particularly due to its reliance on token-level importance sampling, which can lead to high variance and training instability [10][11][12]. - The instability arises from the incorrect application of importance sampling weights at the token level, which can accumulate high variance in long sequences, exacerbating the training challenges [15][16][17]. Group 4: Introduction of GSPO - To address the issues with GRPO, the Qwen team proposed the Group Sequence Policy Optimization (GSPO), which utilizes sequence-level importance sampling to enhance training stability [10][22][31]. - GSPO's design mitigates the accumulation of variance seen in token-level sampling, leading to improved training efficiency and stability [23][24]. Group 5: Experimental Evidence and Advantages - Experimental results demonstrated that GSPO outperformed GRPO in various tasks, showcasing better scalability and efficiency in training [20][30]. - The Qwen team highlighted that GSPO simplifies the training of Mixture-of-Experts (MoE) models by eliminating the need for auxiliary strategies like Routing Replay, which were necessary for GRPO to achieve stable convergence [25][27][30].
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-07 02:38
Group 1 - The establishment of the Embodied Intelligence Heart Technology Exchange Group focuses on various advanced technologies including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the joining process, it is recommended to include a note with the institution/school, name, and research direction [3]
成功率提高57%,VLA+RL最新!CO-RFT:实现VLA模型的高效微调(北航&清华等)
具身智能之心· 2025-08-07 00:03
Core Insights - The article discusses the development of a new reinforcement learning framework called Chunked RL, specifically designed for fine-tuning Vision-Language-Action (VLA) models, which show great potential in real-world robotic control [4][8]. - The proposed CO-RFT algorithm demonstrates significant improvements over traditional supervised fine-tuning methods, achieving a 57% increase in success rate and a 22.3% reduction in cycle time in real-world environments [4][29]. Section Summaries Introduction - VLA models integrate perception and language understanding for embodied control, showing promise in developing general strategies for real-world robotic control [6]. - The challenges faced in fine-tuning VLA models primarily stem from the dependency on the quality and quantity of task-specific data, which limits generalization to out-of-distribution (OOD) scenarios [6][7]. Methodology - The article introduces Chunked RL, a novel reinforcement learning framework that incorporates action chunking to enhance sample efficiency and stability, particularly suited for VLA models [8][12]. - The CO-RFT algorithm consists of two phases: imitation learning for initializing the backbone network and policy, followed by offline RL with action chunking to optimize the pre-trained policy [16][18]. Experimental Analysis - The experiments were conducted on a robotic platform with six dexterous manipulation tasks, evaluating the performance of the CO-RFT algorithm against traditional methods [20][23]. - Results indicate that CO-RFT significantly outperforms supervised fine-tuning (SFT), achieving a 57% increase in success rate and a 22.3% decrease in average cycle time across various tasks [29][30]. Position Generalization - CO-RFT exhibits strong position generalization capabilities, achieving a 44.3% success rate in previously unseen locations, outperforming SFT by 38% in OOD scenarios [4][29]. Importance of Data Diversity - Data diversity plays a crucial role in the performance of CO-RFT, with models trained on diverse datasets showing significantly better generalization capabilities compared to those trained on fixed datasets [32][33].