Workflow
扩散模型
icon
Search documents
从300多篇工作中,看VLA在不同场景下的应用和实现......
具身智能之心· 2025-09-25 04:00
Core Insights - The article discusses the emergence of Vision Language Action (VLA) models, marking a shift in robotics from traditional strategy-based control to a more generalized robotic technology paradigm, enabling active decision-making in complex environments [2][5][20] - It emphasizes the integration of large language models (LLMs) and vision-language models (VLMs) to enhance robotic operations, providing greater flexibility and precision in task execution [6][12] - The survey outlines a clear classification system for VLA methods, categorizing them into autoregressive, diffusion, reinforcement learning, hybrid, and specialized methods, while also addressing the unique contributions and challenges within each category [7][10][22] Group 1: VLA Model Overview - VLA models represent a significant advancement in robotics, allowing for the unification of perception, language understanding, and executable control within a single modeling framework [15][20] - The article categorizes VLA methods into five paradigms: autoregressive, diffusion, reinforcement learning, hybrid, and specialized, detailing their design motivations and core strategies [10][22][23] - The integration of LLMs into VLA systems transforms them from passive input parsers to semantic intermediaries, enhancing their ability to handle long and complex tasks [29][30] Group 2: Applications and Challenges - VLA models have practical applications across various robotic forms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their deployment in diverse scenarios [8][20] - The article identifies key challenges in the VLA field, such as data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technology [8][19][20] - The reliance on high-quality datasets and simulation platforms is crucial for the effective training and evaluation of VLA models, addressing issues of data scarcity and real-world testing risks [16][19] Group 3: Future Directions - The survey outlines future research directions for VLA, including addressing data limitations, enhancing reasoning speed, and improving safety measures to facilitate the advancement of general embodied intelligence [8][20][21] - It highlights the importance of developing scalable and efficient VLA models that can adapt to various tasks and environments, emphasizing the need for ongoing innovation in this rapidly evolving field [20][39] - The article concludes by underscoring the potential of VLA models to bridge the gap between perception, understanding, and action, positioning them as a key frontier in embodied artificial intelligence [20][21][39]
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]
打算招聘几位大佬共创平台(4D标注/世界模型/VLA等方向)
自动驾驶之心· 2025-09-23 23:32
Core Viewpoint - The article discusses the recruitment of business partners for the autonomous driving sector, emphasizing the need for expertise in various advanced technologies and offering attractive incentives for potential candidates [2][3][5]. Group 1: Recruitment Details - The company plans to recruit 10 outstanding partners for autonomous driving-related course development, paper guidance, and hardware research [2]. - Candidates with expertise in areas such as large models, multimodal models, diffusion models, and 3D target detection are particularly welcome [3]. - Preferred qualifications include a master's degree or higher from universities ranked within the QS200, with priority given to candidates who have published in top conferences [4]. Group 2: Incentives and Opportunities - The company offers resource sharing related to autonomous driving, including job recommendations, PhD opportunities, and study abroad guidance [5]. - Attractive cash incentives and opportunities for collaboration on entrepreneurial projects are part of the recruitment package [5].
加速近5倍!北大与字节团队提出BranchGRPO,用「树形分叉 + 剪枝」重塑扩散模型对齐
机器之心· 2025-09-22 07:26
Core Insights - The article introduces BranchGRPO, a novel tree-structured reinforcement learning method developed by Peking University and ByteDance, which addresses the challenges of efficient sampling and stable optimization in human preference alignment for diffusion and flow matching models [2][9]. Group 1: Research Background and Challenges - Diffusion and flow matching models have become mainstream in visual generation due to their high fidelity, diversity, and controllability, but they often fail to align with human intentions, leading to results that deviate from aesthetic, semantic, or temporal consistency [5]. - Human Feedback Reinforcement Learning (RLHF) has been introduced to directly optimize generative models to better align outputs with human preferences [6]. - The existing Group Relative Policy Optimization (GRPO) method shows good stability and scalability in image and video generation but faces two fundamental bottlenecks: inefficiency due to sequential rollout and sparse rewards that ignore critical signals in intermediate states [8]. Group 2: BranchGRPO Methodology - BranchGRPO restructures the sampling process from a single path to a tree structure, allowing for efficient exploration and reducing redundancy in sampling [11][14]. - The method incorporates branching, reward fusion, and pruning mechanisms to enhance both speed and stability, achieving significant improvements in training efficiency and reward attribution [13][14]. - In image alignment tests, BranchGRPO demonstrated a speed increase of up to 4.7 times compared to DanceGRPO, with iteration times dropping from 698 seconds to as low as 148 seconds [15]. Group 3: Performance Metrics - In image alignment (HPDv2.1), BranchGRPO achieved a score of 0.369, surpassing DanceGRPO's score of 0.360, while also achieving the highest image reward of 1.319 [15][17]. - For video generation (WanX-1.3B), BranchGRPO produced clearer and more stable video frames compared to previous models, with iteration times reduced from approximately 20 minutes to about 8 minutes, effectively doubling training efficiency [18][19]. Group 4: Experimental Findings - Ablation studies indicate that moderate branching correlation and early dense splits accelerate reward improvement, while path-weighted reward fusion stabilizes training [23]. - The diversity of samples remains intact with MMD²≈0.019, nearly consistent with sequential sampling [24]. - BranchGRPO's efficiency allows for easy scaling of branch sizes without performance degradation, with iteration times significantly reduced even at larger sample sizes [27]. Group 5: Conclusion and Future Outlook - BranchGRPO innovatively combines efficiency and stability, transforming reward signals from a single endpoint to a continuous feedback mechanism, leading to comprehensive improvements in speed, stability, and alignment effectiveness [30]. - Future developments may include adaptive splitting and pruning strategies, potentially establishing BranchGRPO as a core method for RLHF in diffusion and flow models, enhancing human preference alignment [30].
打算招聘几位大佬共创平台(世界模型/VLA等方向)
自动驾驶之心· 2025-09-21 06:59
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The recruitment targets individuals with expertise in various advanced technologies such as large models, multimodal models, and 3D target detection [3] - Candidates from QS200 universities with a master's degree or higher are preferred, especially those with significant conference contributions [4] Group 2 - The compensation package includes resource sharing for job seeking, PhD recommendations, and study abroad opportunities, along with substantial cash incentives [5] - The company encourages potential partners to reach out via WeChat for collaboration inquiries, specifying the need to mention their organization or company [6]
上交严骏驰团队:近一年顶会顶刊硬核成果盘点
自动驾驶之心· 2025-09-18 23:33
Core Insights - The article discusses the groundbreaking research conducted by Professor Yan Junchi's team at Shanghai Jiao Tong University, focusing on advancements in AI, robotics, and autonomous driving [2][32]. - The team's recent publications in top conferences like CVPR, ICLR, and NeurIPS highlight key trends in AI research, emphasizing the integration of theory and practice, the transformative impact of AI on traditional scientific computing, and the development of more robust, efficient, and autonomous intelligent systems [32]. Group 1: Recent Research Highlights - The paper "Grounding and Enhancing Grid-based Models for Neural Fields" introduces a systematic theoretical framework for grid-based neural field models, leading to the development of the MulFAGrid model, which achieves superior performance in various tasks [4][5]. - The "CR2PQ" method addresses the challenge of cross-view pixel correspondence in dense visual representation learning, demonstrating significant performance improvements over previous methods [6][7]. - The "BTBS-LNS" method effectively tackles the limitations of policy learning in large neighborhood search for mixed-integer programming (MIP), showing competitive performance against commercial solvers like Gurobi [8][10][11]. Group 2: Performance Metrics - The MulFAGrid model achieved a PSNR of 56.19 in 2D image fitting tasks and an IoU of 0.9995 in 3D signed distance field reconstruction tasks, outperforming previous grid-based models [5]. - The CR2PQ method demonstrated a 10.4% mAP^bb and 7.9% mAP^mk improvement over state-of-the-art methods after only 40 pre-training epochs [7]. - The BTBS-LNS method outperformed Gurobi by providing a 10% better primal gap in benchmark tests within a 300-second cutoff time [11]. Group 3: Future Trends in AI Research - The research indicates a shift towards a deeper integration of theoretical foundations with practical applications in AI, suggesting a future where AI technologies are more robust and capable of real-world applications [32]. - The advancements in AI research are expected to lead to smarter robots, more powerful design tools, and more efficient business solutions in the near future [32].
自动驾驶基础模型应该以能力为导向,而不仅是局限于方法本身
自动驾驶之心· 2025-09-16 23:33
Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]
冲破 AGI 迷雾,蚂蚁看到了一个新路标
雷峰网· 2025-09-16 10:20
Core Viewpoint - The article discusses the current state of large language models (LLMs) and the challenges they face in achieving Artificial General Intelligence (AGI), emphasizing the need for new paradigms beyond the existing autoregressive (AR) models [4][10][18]. Group 1: Current Challenges in AI Models - Ilya, a prominent AI researcher, warns that data extraction has reached its limits, hindering the progress towards AGI [2][4]. - The existing LLMs often exhibit significant performance discrepancies, with some capable of outperforming human experts while others struggle with basic tasks [13][15]. - The autoregressive model's limitations include a lack of bidirectional modeling and the inability to correct errors during generation, leading to fundamental misunderstandings in tasks like translation and medical diagnosis [26][27][18]. Group 2: New Directions in AI Research - Elon Musk proposes a "purified data" approach to rewrite human knowledge as a potential pathway to AGI [5]. - Researchers are exploring multimodal approaches, with experts like Fei-Fei Li emphasizing the importance of visual understanding as a cornerstone of intelligence [8]. - A new paradigm, the diffusion model, is being introduced by young scholars, which contrasts with the traditional autoregressive approach by allowing for parallel decoding and iterative correction [12][28]. Group 3: Development of LLaDA-MoE - The LLaDA-MoE model, based on diffusion theory, was announced as a significant advancement in the field, showcasing a new approach to language modeling [12][66]. - LLaDA-MoE has a total parameter count of 7 billion, with 1.4 billion activated parameters, and has been trained on approximately 20 terabytes of data, demonstrating its scalability and stability [66][67]. - The model's performance in benchmark tests indicates that it can compete with existing autoregressive models, suggesting a viable alternative path for future AI development [67][71]. Group 4: Future Prospects and Community Involvement - The development of LLaDA-MoE represents a milestone in the exploration of diffusion models, with plans for further scaling and improvement [72][74]. - The team emphasizes the importance of community collaboration in advancing the diffusion model research, similar to the development of autoregressive models [74][79]. - Ant Group's commitment to investing in AGI research reflects a strategic shift towards exploring innovative and potentially high-risk areas in AI [79].
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
腾讯混元升级AI绘画微调范式,在整个扩散轨迹上优化,人工评估分数提升300%
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses advancements in AI image generation, specifically focusing on the introduction of two key methods, Direct-Align and Semantic Relative Preference Optimization (SRPO), which significantly enhance the quality and aesthetic appeal of generated images [5][14]. Group 1: Current Challenges in Diffusion Models - Existing diffusion models face two main issues: limited optimization steps leading to "reward hacking," and the need for offline adjustments to the reward model for achieving good aesthetic results [4][8]. - The optimization process is constrained to the last few steps of the diffusion process due to high gradient computation costs [8]. Group 2: Direct-Align Method - Direct-Align method allows for the recovery of original images from any time step by pre-injecting noise, thus avoiding the limitations of optimizing only in later steps [5][10]. - This method enables the model to recover clear images from high noise states, addressing the gradient explosion problem during early time step backpropagation [11]. - Experiments show that even at just 5% denoising progress, Direct-Align can recover a rough structure of the image [11][19]. Group 3: Semantic Relative Preference Optimization (SRPO) - SRPO redefines rewards as text-conditioned signals, allowing for online adjustments without additional data by using positive and negative prompt words [14][16]. - The method enhances the model's ability to generate images with improved realism and aesthetic quality, achieving approximately 3.7 times and 3.1 times improvements, respectively [16]. - SRPO allows for flexible style adjustments, such as brightness and cartoon style conversion, based on the frequency of control words in the training set [16]. Group 4: Experimental Results - Comprehensive experiments on the FLUX.1-dev model demonstrate that SRPO outperforms other methods like ReFL, DRaFT, and DanceGRPO across multiple evaluation metrics [17]. - In human evaluations, the excellent rate for realism increased from 8.2% to 38.9% and for aesthetic quality from 9.8% to 40.5% after SRPO training [17][18]. - Notably, a mere 10 minutes of SRPO training allowed FLUX.1-dev to surpass the latest open-source version FLUX.1.Krea on the HPDv2 benchmark [19].