Workflow
自动驾驶之心
icon
Search documents
扩散模型走了十年弯路!何恺明重磅新作JiT:回归真正“去噪”本质
自动驾驶之心· 2025-12-01 00:04
Core Viewpoint - The article discusses the limitations of current diffusion models in denoising tasks and introduces a simplified architecture called JiT (Just image Transformers) that focuses on predicting clean images directly rather than noise, leading to improved performance in high-dimensional pixel spaces [10][18][34]. Group 1: Diffusion Models and Noise Prediction - Traditional diffusion models are designed to predict noise or the amount of mixed noise, which is fundamentally different from predicting clean images [6][7]. - The authors argue that the essence of denoising should be to let the network predict clean data instead of noise, simplifying the task and improving model performance [18][19]. Group 2: JiT Architecture - JiT is a minimalist framework that operates directly on pixel patches without relying on latent spaces, tokenizers, or additional loss functions, making it more efficient [10][25][34]. - The architecture demonstrates that even with high-dimensional patches (up to 3072 dimensions), the model can maintain stable training and performance by focusing on predicting clean images [23][30]. Group 3: Experimental Results - In experiments on ImageNet at various resolutions, JiT models achieved impressive FID scores, with JiT-G/16 reaching 1.82, comparable to complex models that utilize latent spaces [30][31]. - The model's performance remained stable even at higher resolutions (1024×1024), showcasing its capability to handle high-dimensional data without increased computational costs [32][34]. Group 4: Implications for Future Research - The JiT framework suggests a potential shift in generative modeling, emphasizing the importance of working directly in pixel space for applications in embodied intelligence and scientific computing [34].
大模型Post-training的范式已经发生改变......
自动驾驶之心· 2025-12-01 00:04
Core Insights - The article discusses the evolution of post-training paradigms in large models, particularly the shift from SFT+RLHF to a new two-stage approach involving RL Scaling and RL Alignment, which may enhance reasoning capabilities and model performance [3][4][5]. Summary by Sections Post-Training Paradigm Shift - The traditional two-stage post-training method of SFT+RLHF has been widely adopted since the release of GPT-3.5, providing a foundation for rapid convergence and instruction-following capabilities [3]. - The new paradigm suggests that large reasoning models may transition to a two-stage approach involving RL Scaling and RL Alignment, focusing on enhancing self-reflection and reasoning abilities without the need for a convergence foundation [4]. Advantages of the New Approach - RL Scaling can improve model performance on verifiable tasks like math and coding, while RL Alignment adjusts the model to align with human instructions and readability [4]. - This new method potentially mitigates reward hacking issues present in traditional post-training approaches, allowing for greater freedom in token search and enhancing reasoning capabilities [5]. Opportunities and Challenges - The shift to RL Scaling presents opportunities to explore how to utilize data without clear answers and to balance the difficulty of tasks to optimize learning [7]. - There are concerns regarding safety, as the enhanced capabilities from RL Scaling may lead to harmful reasoning emerging from the model, raising questions about the effectiveness of the RL Alignment phase in ensuring safety [6][7]. Generalization and Transferability - The performance improvements seen in math and coding tasks can be generalized to other types of tasks, indicating a broader applicability of the new model capabilities [5]. - Despite the advancements, there remains a preference for models like GPT-4o that excel in understanding user intent and following instructions, highlighting the importance of effective communication and efficiency in practical applications [7].
转具身最好的机会在昨天,其次是现在...
自动驾驶之心· 2025-12-01 00:04
Core Insights - The article discusses the current state of the embodied intelligence industry, focusing on key modules such as industry content, embodiment forms, algorithms, and deployment solutions [1] - It highlights various companies and laboratories engaged in the development of embodied brains and embodiments, providing insights for assessment and advancement in the field [1] Industry Content - Several companies are recommended for research purposes, including the SO-100 series, Openarm series, and XLerobot series, which are capable of implementing various algorithms [2][4] - Openarm is noted for its dual-arm task framework, suitable for specific tasks like folding clothes and pick-and-place operations, although it lacks mobility [4] - XLerobot has limited mobility but is suitable for entry-level research and personal development [6] - Other development platforms are mentioned as being cost-prohibitive, requiring significant financial investment [8] Algorithm Development - The article outlines various algorithmic directions, including VLA (training and non-training methods, VLA+RL, VLA+world models, VLA lightweight, and deployment), VLN (time language, goal navigation, point navigation), control (reinforcement, MPC, WBC), simulation (general, real), and tactile perception [9] Deployment Solutions - Most deployments are currently focused on cloud-based inference, with edge-side solutions based on Sol's framework gradually being implemented [9] - Companies like Xiaopeng have completed VLM/VLA deployments using self-developed chips, while deployments on platforms with less than 100T computing power are rare [9] Community and Resources - The community has established a platform for sharing technical routes, live discussions, and job opportunities, aiming to cultivate talent in the industry [11][17] - A comprehensive technical stack and routes for beginners have been organized to facilitate entry into the field [13] - Valuable industry systems and project proposals are provided for those already engaged in related research [15] Networking and Job Opportunities - The community has established a job referral mechanism with various embodied companies, allowing members to connect with potential employers [17] - Members can access exclusive learning videos and documents, enhancing the educational experience [21] Research and Development Resources - The community has compiled over 40 open-source projects, 60 datasets related to embodied intelligence, and various technical learning routes to support both newcomers and advanced learners [18] - A summary of domestic and international embodied intelligence companies across various sectors, including education, logistics, and healthcare, is provided [23]
英伟达又一新作!MPA:基于模型的闭环端到端自适应策略新框架(CMU&斯坦福等)
自动驾驶之心· 2025-12-01 00:04
Core Insights - The article discusses the Model-Based Policy Adaptation (MPA) framework aimed at enhancing the robustness and safety of end-to-end (E2E) autonomous driving agents during closed-loop evaluations [2][6][41] - MPA addresses the challenges of cascading errors and insufficient generalization capabilities in closed-loop evaluations by utilizing a model-based approach to adapt pre-trained E2E driving agents [2][6] Summary by Sections Background - E2E autonomous driving models have shown significant progress by integrating perception, prediction, and planning into a unified learning framework, but they face performance degradation in closed-loop environments due to cumulative errors and distribution shifts [3][6] - The gap between offline training and online objectives highlights the need for improved closed-loop performance evaluation [5][9] MPA Framework - MPA is designed to bridge the performance gap by generating counterfactual data using a high-fidelity 3D Gaussian splatter (3DGS) simulation engine, which allows the agent to experience diverse scenarios beyond the original dataset [7][14] - The framework includes a diffusion model-based policy adapter and a multi-step Q-value model to optimize the agent's predictions and evaluate long-term rewards [7][21] Experimental Results - MPA was validated on the nuScenes benchmark dataset, demonstrating significant performance improvements in both in-domain and out-of-domain scenarios, particularly in safety-critical situations [11][33] - The results indicate that MPA outperforms baseline models, achieving higher scores in key metrics such as route completion (RC) and HDScore [33][36] Contributions - The article outlines three main contributions: 1. Analysis of the root causes of performance decline in closed-loop evaluations and the fidelity of 3DGS simulations [11][41] 2. Development of a systematic counterfactual data generation process and training of the MPA framework [11][43] 3. Demonstration of MPA's effectiveness in enhancing the performance of E2E driving agents in various scenarios [41][43] Limitations and Future Work - The MPA framework relies on the assumption that 3DGS can provide reliable rendering under constrained trajectory deviations, which may not hold in all cases [44] - Future work will focus on expanding the dataset, integrating online reinforcement learning, and enhancing the framework's robustness in diverse driving conditions [44][46]
被遗忘的商汤绝影
自动驾驶之心· 2025-11-30 02:02
Core Viewpoint - The article discusses the challenges and dynamics faced by the autonomous driving sector, particularly focusing on the company SenseTime's subsidiary, Absolute Shadow, as it seeks external financing amidst a tightening market environment [4][5][20]. Group 1: Market Dynamics - The autonomous driving battlefield is entering a critical phase, with significant events such as the announcement of a 3.6 billion financing round by another player, indicating a narrowing financing environment [5]. - Absolute Shadow is seen as a unique player outside the final competition circle, struggling to secure its position in a market dominated by tech giants and established automotive manufacturers [6][10]. Group 2: Company Positioning - Absolute Shadow is categorized among three types of companies in the autonomous driving landscape: those incubated by tech giants, those supported by automotive manufacturers, and those founded by star entrepreneurs [6]. - The company has faced challenges in becoming a core platform provider, with its product lines significantly reduced and a focus on specific platforms like Horizon and NVIDIA [24][26]. Group 3: Talent and Management Issues - The company has experienced significant turnover in its leadership, impacting its ability to meet the demands of long-term production cycles [27]. - Frequent changes in management have led to a disconnect between the algorithm and engineering teams, hindering the transition from theoretical models to practical applications [31]. Group 4: Customer Relationships - Absolute Shadow's customer base primarily consists of secondary suppliers, with its largest client, Nezha Auto, facing operational challenges that jeopardize future orders [28][29]. - The company has attempted to attract clients through innovative delivery models but risks being marginalized as competitors solidify their partnerships [29]. Group 5: Financial Viability and Future Outlook - The company has struggled with profitability, with most of its revenue coming from low-margin products rather than high-value autonomous driving solutions [31]. - Despite its challenges, Absolute Shadow retains potential value in areas like AI infrastructure and multi-modal interactions, although it has fallen behind in the autonomous driving sector [32][33].
简历直推!小马智行多模态大模型实习生招聘
自动驾驶之心· 2025-11-30 02:02
Core Viewpoint - The article focuses on the recruitment of interns by PonyAi, emphasizing the need for skills in perception algorithm development and optimization in the autonomous driving industry [2][6]. Group 1: Responsibilities - The role involves developing perception capabilities driven by scene descriptions and natural language instructions based on Visual-Language Models (VLM), aiming to implement models in real-world scenarios [2]. - Responsibilities also include developing and optimizing perception algorithms based on Camera, LiDAR, and multi-modal fusion, covering areas such as object detection, semantic/instance segmentation, object tracking, and 3D reconstruction [6]. Group 2: Qualifications - Candidates with internship experience in the autonomous driving industry or those able to commit to a 6-month internship will receive preference [3]. - A bachelor's degree or higher in computer science or a related field is required, along with proficiency in deep learning and computer vision algorithms [6]. - Familiarity with CNN-based image detection, tracking, and recognition processes, as well as strong programming skills in C/C++ or Python, is essential [6]. - Candidates with strong research capabilities, such as having published first-author papers in CCF A-class conferences or journals, will be prioritized [6]. - Knowledge of deep learning frameworks like Pytorch and experience in parallel computing or CUDA programming are advantageous [6].
即将开课!做了一份3DGS的学习路线图,面向初学者......
自动驾驶之心· 2025-11-30 02:02
Core Insights - The article emphasizes the rapid technological iteration in 3DGS (3D Graphics Systems), highlighting the transition from static reconstruction (3DGS) to dynamic reconstruction (4DGS) and surface reconstruction (2DGS) [1] - A new course titled "3DGS Theory and Algorithm Practical Tutorial" has been developed to provide a structured learning roadmap for individuals interested in entering the field, covering essential theories and practical coding skills [1] Course Overview - The course is designed to help newcomers understand the foundational concepts of computer graphics, including implicit and explicit representations of 3D space, rendering pipelines, ray tracing, and radiation field rendering [5] - It introduces commonly used development tools such as SuperSplat and COLMAP, along with the mainstream algorithm framework Gsplat [5] Chapter Summaries - **Chapter 1: Background Knowledge** This chapter provides an overview of 3DGS, starting with basic computer graphics concepts and tools necessary for model training [5] - **Chapter 2: Principles and Algorithms** Focuses on the core principles and pseudocode of 3DGS, covering dynamic reconstruction, surface reconstruction, and ray tracing, utilizing the NVIDIA open-source 3DGRUT framework for practical learning [6] - **Chapter 3: Autonomous Driving 3DGS** Concentrates on key works in the field, such as Street Gaussian and OmniRe, and uses DriveStudio for practical applications [7] - **Chapter 4: Important Research Directions** Discusses significant research areas in 3DGS, including COLMAP extensions and depth estimation, and their relevance to both industry and academia [8] - **Chapter 5: Feed-Forward 3DGS** Explores the rise of feed-forward 3DGS, detailing its development and algorithmic principles, along with recent works like AnySplat and WorldSplat [9] - **Chapter 6: Q&A Discussion** Organizes online discussions for participants to address industry needs, pain points, and open questions, facilitating deeper engagement with instructors [10] Target Audience and Learning Outcomes - The course is aimed at individuals with a foundational understanding of computer graphics, visual reconstruction, and programming in Python and PyTorch, who are looking to enhance their knowledge and skills in 3DGS [14] - Participants will gain comprehensive theoretical knowledge and practical experience in 3DGS algorithm development and frameworks, preparing them for various career opportunities in the field [14]
轻舟智航最新GuideFlow:端到端轨迹规划新方案
自动驾驶之心· 2025-11-30 02:02
Core Insights - The article discusses the development of a new planning framework called GuideFlow, which addresses the challenges of trajectory generation in end-to-end autonomous driving by incorporating explicit constraints and enhancing model optimization capabilities [3][11][49] - GuideFlow integrates various conditional signals to guide the generation process, improving the robustness and safety of autonomous driving systems [11][49] Summary by Sections Background Review - End-to-end autonomous driving (E2E-AD) has emerged as an attractive alternative to traditional modular approaches, allowing for unified training through data [9] - Recent advancements have shifted from single-modal to multi-modal trajectory generation to better reflect inherent uncertainties in real driving scenarios [9][10] GuideFlow Framework - GuideFlow explicitly models the flow matching process to alleviate mode collapse issues and flexibly integrates multiple guiding signals [3][11] - The framework combines flow matching with Energy-Based Model (EBM) training to enhance the model's ability to meet physical constraints [3][11] Experimental Results - GuideFlow demonstrated superior performance on various benchmark datasets, achieving state-of-the-art (SOTA) results, particularly on the challenging NavSim dataset with an Extended PMD Score (EPDMS) of 43.0 [3][34][37] - The framework's collision rate was notably low, with an average of 0.07% on the NuScenes dataset, showcasing its safety capabilities [40][41] Contributions and Innovations - The article highlights three core strategies within GuideFlow: speed field constraints, flow state constraints, and EBM flow optimization, which collectively enhance trajectory feasibility and safety [11][28][31] - The integration of driving aggressiveness scoring allows for dynamic adjustments in trajectory styles during inference, further refining the model's adaptability [33][49] Conclusion - GuideFlow represents a significant advancement in trajectory planning for autonomous driving, effectively embedding safety constraints into the generation process and demonstrating robust performance across various datasets [49]
明日开课!端到端量产究竟在做什么?我们筹备了一门落地课程...
自动驾驶之心· 2025-11-29 02:06
Core Viewpoint - The article emphasizes the importance of end-to-end production in the automotive industry, highlighting the scarcity of qualified talent and the need for comprehensive training programs to address various challenges in this field [1][3]. Course Overview - The course is designed to cover essential algorithms related to end-to-end production, including single-stage and two-stage frameworks, reinforcement learning applications, and trajectory optimization [3][9]. - It aims to provide practical experience and insights into production challenges, focusing on real-world applications and expert guidance [3][16]. Course Structure - Chapter 1 introduces the overview of end-to-end tasks, discussing the integration of perception and control algorithms, and the importance of efficient data handling [9]. - Chapter 2 focuses on the two-stage end-to-end algorithm framework, explaining its modeling and information transfer processes [10]. - Chapter 3 covers the single-stage end-to-end algorithm framework, emphasizing its advantages in information transmission and performance [11]. - Chapter 4 discusses the application of navigation information in autonomous driving, detailing the formats and encoding methods of navigation maps [12]. - Chapter 5 introduces reinforcement learning algorithms, highlighting their necessity in complementing imitation learning for better generalization [13]. - Chapter 6 involves practical projects on trajectory output optimization, combining imitation and reinforcement learning techniques [14]. - Chapter 7 presents fallback strategies for trajectory planning, focusing on smoothing algorithms to enhance output reliability [15]. - Chapter 8 shares production experiences from various perspectives, offering strategies for optimizing system capabilities [16]. Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [17][18].
图解Qwen3-VL多模态模型
自动驾驶之心· 2025-11-29 02:06
Core Insights - The article discusses the Qwen3-VL model, a visual language model (VLM) that processes both text and images as input, emphasizing its architecture and implementation details [3][4]. Group 1: Model Overview - Qwen3-VL is an autoregressive AI model designed to handle multimodal inputs, specifically text and images [3]. - The model's architecture includes various components such as configuration files, modeling files, and processing files for images and videos [5][6]. Group 2: Source Code Analysis - The source code of Qwen3-VL is structured into several classes, including Qwen3VLVisionMLP, Qwen3VLVisionPatchEmbed, and Qwen3VLForConditionalGeneration, each serving specific functions within the model [6][12]. - The Qwen3VLProcessor class converts input images into pixel values, utilizing the Qwen2-VL image processor for this task [7][10]. Group 3: Image Processing - The image processing function involves resizing, normalizing, and preparing images for input into the model, ultimately returning pixel values that serve as input [8][9]. - The model processes images in batches, grouping them by size for efficient resizing and normalization [9]. Group 4: Model Execution Flow - The Qwen3-VLForConditionalGeneration class serves as the entry point for the model, where input pixel values and text input IDs are processed to generate outputs [15][16]. - The model's forward method outlines the steps taken to integrate image and text features, including embedding images into the input sequence [21][22]. Group 5: Vision Encoder - The vision encoder of Qwen3-VL is custom-built, differing from existing models like CLIP, and utilizes a 3D convolution to convert images into hidden states [35][37]. - The encoder incorporates attention mechanisms and position encoding to enhance the model's ability to process visual data [40][41]. Group 6: Final Outputs - The final output of the model combines the processed image and text features, which are then forwarded to the language model for further processing [33][34]. - The architecture allows for the integration of visual and textual data, enabling the model to generate coherent outputs based on multimodal inputs [44].