Workflow
自动驾驶之心
icon
Search documents
大佬面对面!斯坦福2025 CS336课程全公开:从零开始搓大模型~
自动驾驶之心· 2025-06-24 11:47
Core Viewpoint - The article discusses the launch of Stanford University's CS336 course "Language Models from Scratch," which aims to provide a comprehensive understanding of language models through practical development and implementation [5][7]. Course Overview - The course focuses on the foundational aspects of language models, which are essential for modern natural language processing (NLP) applications. It emphasizes the importance of understanding language models for scientists and engineers in the fields of AI and ML [5][7]. - The course is structured into five major modules: Foundations, Systems, Extensions, Data, and Alignment & Reinforcement Learning [7]. Course Requirements - Students are expected to have proficiency in Python, as most assignments will require extensive coding. The course will provide minimal scaffolding, resulting in a higher volume of code written by students compared to other AI courses [7]. - A background in deep learning and system optimization is necessary, particularly familiarity with PyTorch and basic system concepts like memory hierarchy [7]. - Foundational knowledge in calculus, linear algebra, probability, and statistics is required, along with a basic understanding of machine learning principles [7]. Assignments - The course includes several assignments that cover various aspects of language model development, such as implementing a BPE tokenizer, training models on specific datasets, and optimizing performance on GPUs [8]. - Assignments are designed to simulate real-world challenges, including data processing and model alignment, with a focus on practical application and hands-on experience [8]. Course Schedule - The course is structured with a detailed schedule that outlines topics, materials, and deadlines for assignments, ensuring a systematic approach to learning [9].
华为车BU招聘(端到端/感知模型/模型优化等)!岗位多多~
自动驾驶之心· 2025-06-24 07:21
Core Viewpoint - The article emphasizes the rapid evolution and commercialization of autonomous driving technologies, highlighting the importance of community engagement and knowledge sharing in this field [9][14][19]. Group 1: Job Opportunities and Community Engagement - Huawei is actively recruiting for various positions in its autonomous driving division, including roles focused on end-to-end model algorithms, perception models, and efficiency optimization [1][2]. - The "Autonomous Driving Heart Knowledge Planet" serves as a platform for technical exchange, targeting students and professionals in the autonomous driving and AI sectors, and has established connections with numerous industry companies for job referrals [7][14][15]. Group 2: Technological Trends and Future Directions - The article outlines that by 2025, the focus will be on advanced technologies such as visual large language models (VLM), end-to-end trajectory prediction, and 3D generative simulations, indicating a shift towards more integrated and intelligent systems in autonomous driving [9][22]. - The community has developed over 30 learning pathways covering various subfields of autonomous driving, including perception, mapping, and AI model deployment, which are crucial for industry professionals [19][21]. Group 3: Educational Resources and Content - The knowledge platform offers exclusive rights to members, including access to academic advancements, professional Q&A sessions, and discounts on courses, fostering a comprehensive learning environment [17][19]. - Regular webinars featuring experts from top conferences and companies are organized to discuss practical applications and research in autonomous driving, enhancing the learning experience for participants [21][22].
SwitchVLA:无需额外数据采集,即可实时动态任务切换的轻量化VLA模型
自动驾驶之心· 2025-06-24 02:54
Core Viewpoint - The article introduces SwitchVLA, a lightweight and data-efficient method for dynamic task perception and decision-making, addressing the challenges of task switching in multi-task VLA models, achieving superior performance compared to existing methods [3][22]. Group 1: Introduction - Current mainstream multi-task VLA models struggle with task switching, defined as "Task Switching," where the model's ability to adapt to new tasks mid-execution is limited [3][5]. - SwitchVLA employs an Execution-Aware mechanism and a lightweight network architecture to facilitate task switching without the need for additional data collection [3][10]. Group 2: Background - Multi-task VLA training typically involves independent data collection for each task, leading to challenges in seamlessly transitioning between tasks [5]. - The inability of existing SOTA VLA methods to effectively handle task switching is highlighted, emphasizing the need for improved solutions [5][10]. Group 3: Methodology - SwitchVLA addresses two core problems: representing task switching without extra data collection and training an end-to-end imitation learning model that autonomously judges based on current conditions [10][12]. - The model improves task switching representation by concatenating previous task, current task, and the previous task's stage, enhancing the model's ability to perceive task transitions [12][13]. - A simplified training process categorizes tasks into three stages: before contact, during contact, and after contact, allowing for effective task switching without additional data [15][16]. Group 4: Experimental Results - Experiments demonstrate that SwitchVLA outperforms existing methods in task switching scenarios while maintaining comparable performance in single-task settings [20][22]. - The analysis of task switching failures reveals that the proposed method effectively mitigates common failure causes [20]. Group 5: Conclusion and Future Directions - SwitchVLA is positioned as a significant advancement in dynamic task management, with plans for further iterations and deployment in humanoid robots for applications in flexible industrial production and personalized commercial services [22][23].
端到端系列!SpareDrive:基于稀疏场景表示的端到端自动驾驶~
自动驾驶之心· 2025-06-23 11:34
Core Viewpoint - The article discusses the limitations of existing end-to-end methods in autonomous driving, particularly the computational intensity of BEV paradigms and the inefficiency of sequential prediction and planning approaches. It proposes a new Sparse paradigm that allows for parallel processing of prediction and planning tasks [2][5]. Group 1: SparseDrive Methodology - SparseDrive adopts the core ideas from the previous Horizon Sparse series, focusing on sparse scene representation for autonomous driving [3]. - The proposed method modifies the similarities between motion prediction and planning, introducing a hierarchical planning selection strategy [5]. - The architecture includes features such as symmetric sparse perception and a parallel motion planner [5]. Group 2: Training and Performance - The training loss function for SparseDrive is defined as a combination of detection, mapping, motion, planning, and depth losses [9]. - Performance comparisons show that SparseDrive-S achieves a mean Average Precision (mAP) of 0.418, while SparseDrive-B reaches 0.496, outperforming other methods like UniAD [11]. - In motion prediction and planning, SparseDrive-S and SparseDrive-B demonstrate significant improvements in metrics such as minADE and minFDE compared to traditional methods [18]. Group 3: Efficiency Comparison - SparseDrive exhibits superior training and inference efficiency, requiring only 15.2 GB of GPU memory and achieving 9.0 FPS during inference, compared to UniAD's 50.0 GB and 1.8 FPS [20]. - The method's reduced computational requirements make it more accessible for real-time applications in autonomous driving [20]. Group 4: Course and Learning Opportunities - The article promotes a course focused on end-to-end autonomous driving algorithms, covering foundational knowledge, practical implementations, and various algorithmic approaches [29][41]. - The course aims to equip participants with the skills necessary to understand and implement end-to-end solutions in the autonomous driving industry [54][56].
上交&卡尔动力FastDrive!结构化标签实现端到端大模型更快更强~
自动驾驶之心· 2025-06-23 11:34
Core Viewpoint - The integration of human-like reasoning capabilities into end-to-end autonomous driving systems is a cutting-edge research area, with a focus on vision-language models (VLMs) [1]. Group 1: Structured Dataset and Model - A structured dataset called NuScenes-S has been introduced, which focuses on key elements closely related to driving decisions, eliminating redundant information and improving reasoning efficiency [4][5]. - The FastDrive model, with 0.9 billion parameters, mimics human reasoning strategies and effectively aligns with end-to-end autonomous driving frameworks [4][5]. Group 2: Dataset Description - The NuScenes-S dataset provides a comprehensive view of driving scenarios, addressing issues often overlooked in existing datasets. It includes key elements such as weather, traffic conditions, driving areas, traffic lights, traffic signs, road conditions, lane markings, and time [7][8]. - The dataset construction involved annotating scene information using both GPT and human input, refining the results through comparison and optimization [9]. Group 3: FastDrive Algorithm Model - The FastDrive model follows the "ViT-Adapter-LLM" architecture, utilizing a Vision Transformer for visual feature extraction and a token-packing module to enhance inference speed [18][19]. - The model employs a large language model (LLM) to generate scene descriptions, identify key objects, predict future states, and make driving decisions in a reasoning chain manner [19]. Group 4: Experimental Results - Experiments conducted on the NuScenes-S dataset, which contains 102,000 question-answer pairs, demonstrated that FastDrive achieved competitive performance in scene understanding tasks [21]. - The performance metrics for FastDrive showed strong results in perception, prediction, and decision-making tasks, outperforming other models [25].
ADAS新范式!北理&清华MMTL-UniAD:多模态和多任务学习统一SOTA框架(CVPR'25)
自动驾驶之心· 2025-06-23 11:34
Core Insights - The article presents MMTL-UniAD, a unified framework for multimodal and multi-task learning in assistive driving perception, which aims to enhance the performance of advanced driver-assistance systems (ADAS) by simultaneously recognizing driver behavior, emotions, traffic environment, and vehicle actions [1][5][26]. Group 1: Introduction and Background - Advanced driver-assistance systems (ADAS) have significantly improved driving safety over the past decade, yet approximately 1.35 million people die in traffic accidents annually, with over 65% of these incidents linked to abnormal driver psychological or physiological states [3]. - Current research often focuses on single tasks, such as driver behavior or emotion recognition, neglecting the inherent connections between these tasks, which limits the potential for cross-task learning [4][3]. Group 2: Framework and Methodology - MMTL-UniAD employs a multimodal approach to achieve synchronized recognition of driver behavior, emotions, traffic environment, and vehicle actions, addressing the challenge of negative transfer in multi-task learning [5][26]. - The framework incorporates two core components: a multi-axis region attention network (MARNet) and a dual-branch multimodal embedding module, which effectively extract task-shared and task-specific features [5][26]. Group 3: Experimental Results - MMTL-UniAD outperforms existing state-of-the-art methods across multiple tasks, achieving performance improvements of 4.10% to 12.09% in the mAcc metric on the AIDE dataset [18][26]. - The framework demonstrates superior accuracy in driver behavior recognition and vehicle behavior recognition, with increases of 4.64% and 3.62%, respectively [18][26]. Group 4: Ablation Studies - Ablation experiments indicate that joint training of driver state tasks and traffic environment tasks enhances feature sharing, significantly improving task recognition accuracy [22][26]. - The results confirm that the interdependence of tasks in MMTL-UniAD contributes to overall performance and generalization capabilities [22][26].
热乎出炉的面经,刚面完NVIDIA TRT LLM~
自动驾驶之心· 2025-06-23 11:34
Core Insights - The article discusses a recent interview experience with Nvidia for a position related to LLM inference acceleration, highlighting the rigorous interview process and technical discussions involved [1]. Group 1: Interview Process - The interview consisted of four rounds, each lasting one hour, with a total duration of four hours, indicating a thorough evaluation process by Nvidia [1]. - The first interviewer focused on the candidate's research work, particularly on speculative decoding, and included a coding challenge that the candidate struggled with due to lack of practice [1]. - The second interviewer demonstrated familiarity with the candidate's research, engaging in a deeper discussion about speculative decoding and presenting a string-related coding problem [1]. Group 2: Technical Discussions - The third interviewer, a female group leader, discussed the development directions of speculative decoding in high-batch scenarios and posed questions about transformer structures, specifically regarding the dimensions of Q and K [1]. - The fourth interviewer, who was the only one to turn on the camera, engaged in discussions from a systems perspective, providing valuable insights and confirming understanding during the presentation [1]. Group 3: Internship Details - The internship location options include Shanghai, Beijing, or remote work, with a focus on inference optimization rather than purely research-oriented tasks [1]. - The expected internship salary ranges from 8,000 to 10,000 yuan, reflecting the competitive nature of positions in the tech industry [1].
为什么一篇论文要耗尽整个研究生生涯?
自动驾驶之心· 2025-06-23 08:03
Core Viewpoint - The article emphasizes the challenges faced by students in publishing academic papers in cutting-edge fields like autonomous driving, embodied intelligence, and robotics, and introduces a comprehensive tutoring service to assist them in overcoming these obstacles [2][3]. Group 1: Company Overview - The company is described as the largest AI technology self-media platform in China, focusing on autonomous driving, embodied intelligence, and robotics, with a deep understanding of the challenges and opportunities in these interdisciplinary fields [3]. - The tutoring service has a team of over 300 dedicated instructors from globally recognized universities, with a high manuscript submission success rate of 96% over the past three years [3]. Group 2: Services Offered - The company provides a full range of tutoring services, including assistance with topic selection, experimental design, model optimization, and writing for undergraduate, master's, and doctoral students [4][12]. - Specific areas of tutoring include large models, end-to-end autonomous driving, multi-sensor fusion, and various advanced topics in AI and robotics [5][11]. Group 3: Tutoring Approach - The tutoring service emphasizes personalized, one-on-one guidance tailored to the specific research interests and backgrounds of students, avoiding a one-size-fits-all approach [9]. - The instructors possess extensive experience in publishing in top-tier conferences and journals, ensuring familiarity with the review processes and preferences [8]. Group 4: Problem-Solving Capabilities - The service aims to address common challenges faced by students, such as finding innovative research topics, conducting literature reviews, designing experiments, and improving writing quality [10][12]. - The company also focuses on enhancing the likelihood of paper acceptance by providing strategic advice on journal and conference selection, as well as effective responses to reviewer comments [12][15].
深入浅出完整解析LoRA(Low-Rank Adaptation)模型核心基础知识
自动驾驶之心· 2025-06-22 14:09
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 大模型高效微调已经成为业界关注的焦点,无论是通用大模型,还是智驾大模型,如何通过轻量微调变成各个不同领域的专业模型,成为 讨论的热点。所以今天就来大家一起聊聊LORA。 背景: 业内的大公司或者研究机构,都是有足够资源的来开发大模型,但是对于一般的小公司或者个人来说,要想开发自己的大模型几乎 不可能,要知道像 ChatGPT 这样的大模型,一次训练的成本就在上千万美元,而即使是DeepSeekv3,单次训练成本也在500万美元以上, 所以充分利用开源大模型,在领域任务上高效微调便成为了当下学术界和工业界迫切需要解决的问题,至此LoRA问世: LoRA 的思想很简单: 而这个降维的操作就需要用到低秩分解了,接下来我们回顾下低秩分解: * [16] A. A. K. 那么LoRA训练的思路和优势是什么呢? 在原始 PLM (Pre-trained Language Model) 旁边增加一个旁路,做一个降维再升维的操作,来模拟所谓的intrinsic rank。 训练的时候固定 PLM 的参数,只训练降维矩阵 A ...
大模型强化学习,相比PPO,DPO 还是个弟弟?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the theoretical and experimental shortcomings of DPO (Direct Preference Optimization) compared to PPO (Proximal Policy Optimization), highlighting that while DPO appears to lead in open-source benchmarks, top closed-source models like GPT-4 and Claude utilize PPO [1][2]. DPO's Deficiencies - DPO encounters issues similar to reward hacking, where it can produce solutions that do not align with human preferences, despite lacking an explicit reward model [2]. - The theoretical framework suggests that the strategies derived from PPO are a true subset of those from DPO when given true reward signals, indicating that DPO may generate solutions that deviate from reference strategies [3]. Experimental Findings - Experiments reveal that DPO can assign higher probabilities to data points not covered in the preference dataset, leading to unexpected behaviors, while PPO optimizes effectively under KL constraints [6]. - The performance of DPO can be improved by reducing distribution drift through methods like SafeSFT, but it still does not surpass PPO [8]. Performance Metrics - Benchmark results consistently show that PPO outperforms both iterative DPO and DPO in various tasks, particularly in programming competitions [10]. - Specific metrics indicate that models using PPO achieve significantly higher pass rates compared to those using DPO, with PPO models reaching up to 44.4% in pass@5 metrics, while DPO models struggle to achieve meaningful results [11][12]. Conclusion - The findings suggest that while DPO has theoretical merits, its practical application in high-stakes tasks like programming is limited compared to PPO, which continues to set new standards in performance [13].