Workflow
扩散模型
icon
Search documents
筹备了半年!端到端与VLA自动驾驶小班课来啦(一段式/两段式/扩散模型/VLA等)
自动驾驶之心· 2025-07-09 12:02
Core Viewpoint - End-to-End Autonomous Driving is the core algorithm for the next generation of intelligent driving mass production, marking a significant shift in the industry towards more integrated and efficient systems [1][3]. Group 1: End-to-End Autonomous Driving Overview - End-to-End Autonomous Driving can be categorized into single-stage and two-stage approaches, with the former directly modeling vehicle planning and control from sensor data, thus avoiding error accumulation seen in modular methods [1][4]. - The emergence of UniAD has initiated a new wave of competition in the autonomous driving sector, with various algorithms rapidly developing in response to its success [1][3]. Group 2: Challenges in Learning and Development - The rapid advancement in technology has made previous educational resources outdated, creating a need for updated learning paths that encompass multi-modal large models, BEV perception, reinforcement learning, and more [3][5]. - Beginners face significant challenges due to the fragmented nature of knowledge across various fields, making it difficult to extract frameworks and understand development trends [3][6]. Group 3: Course Structure and Content - The course on End-to-End and VLA Autonomous Driving aims to address these challenges by providing a structured learning path that includes practical applications and theoretical foundations [5][7]. - The curriculum covers the history and evolution of End-to-End algorithms, background knowledge necessary for understanding current technologies, and practical applications of various models [8][9]. Group 4: Key Technologies and Innovations - The course highlights significant advancements in two-stage and single-stage End-to-End methods, including notable algorithms like PLUTO and DiffusionDrive, which represent the forefront of research in the field [4][10][12]. - The integration of large language models (VLA) into End-to-End systems is emphasized as a critical area of development, with companies actively exploring new generation mass production solutions [13][14]. Group 5: Expected Outcomes and Skills Development - Upon completion of the course, participants are expected to reach a level equivalent to one year of experience as an End-to-End Autonomous Driving algorithm engineer, mastering various methodologies and key technologies [22][23]. - The course aims to equip participants with the ability to apply learned concepts to real-world projects, enhancing their employability in the autonomous driving sector [22][23].
自动驾驶黄埔军校,一个死磕技术的地方~
自动驾驶之心· 2025-07-06 12:30
Core Viewpoint - The article discusses the transition of autonomous driving technology from Level 2/3 (assisted driving) to Level 4/5 (fully autonomous driving), highlighting the challenges and opportunities in the industry as well as the evolving skill requirements for professionals in the field [2]. Industry Trends - The shift towards high-level autonomous driving is creating a competitive landscape where traditional sensor-based approaches, such as LiDAR, are being challenged by cost-effective vision-based solutions like those from Tesla [2]. - The demand for skills in reinforcement learning and advanced perception algorithms is increasing, leading to a sense of urgency among professionals to upgrade their capabilities [2]. Talent Market Dynamics - The article notes a growing anxiety among seasoned professionals as they face the need to adapt to new technologies and methodologies, while newcomers struggle with the overwhelming number of career paths available in the autonomous driving sector [2]. - The reduction in costs for LiDAR technology, exemplified by Hesai Technology's price drop to $200 and BYD's 70% price reduction, indicates a shift in the market that requires continuous learning and adaptation from industry professionals [2]. Community and Learning Resources - The establishment of the "Autonomous Driving Heart Knowledge Planet" aims to create a comprehensive learning community for professionals, offering resources and networking opportunities to help individuals navigate the rapidly changing landscape of autonomous driving technology [7]. - The community has attracted nearly 4,000 members and over 100 industry experts, providing a platform for knowledge sharing and career advancement [7]. Technical Focus Areas - The article outlines several key technical areas within autonomous driving, including end-to-end driving systems, perception algorithms, and the integration of AI models for improved performance [10][11]. - It emphasizes the importance of understanding various subfields such as multi-sensor fusion, high-definition mapping, and AI model deployment, which are critical for the development of autonomous driving technologies [7].
一个气泡水广告,为何几十万人围观?原来整个都是Veo 3生成的
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the emergence of AI-generated content, particularly focusing on a viral advertisement created by the team "Too Short for Modeling," which showcases the advancements in AI video generation technology, specifically the Veo 3 model [2][3][4]. Group 1: AI Video Generation - The advertisement created by the team has garnered over 300,000 views on social media, highlighting the growing interest in AI-generated content [2]. - The Veo 3 model introduces a new "audio-visual synchronization" feature, significantly lowering the barriers to video creation and enhancing the practicality of AI in this field [4]. - The advertisement demonstrates impressive "character consistency," smoothly transitioning through 10 scenes within a minute while maintaining a high degree of stylistic uniformity [7]. Group 2: Technical Insights - The creators achieved this consistency through "hyper-specific prompting," which involves providing the AI model with detailed and context-rich instructions to minimize its creative freedom [9][10]. - Despite the advancements, AI-generated videos often face issues such as character appearance changes and object distortions, which are attributed to the underlying technology and training data limitations [8][14]. - The article outlines several reasons for these inconsistencies, including the model's reliance on probability rather than true understanding, challenges in maintaining coherence across frames, and the quality of training data [19][14]. Group 3: Creative Potential of AI - The article emphasizes the potential of AI as a "creative catalyst," suggesting that it can inspire innovative ideas, such as creating parallel universes for favorite movies [17][22]. - It encourages exploration of AI's capabilities in various creative domains, including website development and concept film production [24][25].
本来决定去具身,现在有点犹豫了。。。
自动驾驶之心· 2025-07-05 09:12
Core Insights - The article discusses the evolving landscape of embodied intelligence, highlighting its transition from a period of hype to a more measured approach as the technology matures and is not yet at a productivity stage [2]. Group 1: Industry Trends - Embodied intelligence has gained significant attention over the past few years, but the industry is now recognizing that it is still in the early stages of development [2]. - There is a growing demand for skills in multi-sensor fusion and robotics, particularly in areas like SLAM and ROS, which are crucial for engaging with embodied intelligence [3][4]. - Many companies in the robotics sector are rapidly developing, with numerous startups receiving substantial funding, indicating a positive outlook for the industry in the coming years [3][4]. Group 2: Job Market and Skills Development - The job market for algorithm positions is competitive, with a focus on cutting-edge technologies such as end-to-end models, VLA, and reinforcement learning [3]. - Candidates with a background in robotics and a solid understanding of the latest technologies are likely to find opportunities, especially as traditional robotics remains a primary product line [4]. - The article encourages individuals to enhance their technical skills in robotics and embodied intelligence to remain competitive in the job market [3][4]. Group 3: Community and Resources - The article promotes a community platform that offers resources for learning about autonomous driving and embodied intelligence, including video courses and job postings [5]. - The community aims to gather a large number of professionals and students interested in smart driving and embodied intelligence, fostering collaboration and knowledge sharing [5]. - The platform provides access to the latest industry trends, technical discussions, and job opportunities, making it a valuable resource for those looking to enter or advance in the field [5].
ICCV 2025|降低扩散模型中的时空冗余,上交大EEdit实现免训练图像编辑加速
机器之心· 2025-07-05 02:46
Core Viewpoint - The article discusses the latest research from Professor Zhang Linfeng's team at Shanghai Jiao Tong University, introducing EEdit, a novel framework designed to enhance the efficiency of image editing by addressing spatial and temporal redundancy in diffusion models, achieving a speedup of over 2.4 times compared to previous methods [1][6][8]. Summary by Sections Research Motivation - The authors identified significant spatial and temporal redundancy in image editing tasks using diffusion models, leading to unnecessary computational overhead, particularly in non-editing areas [12][14]. - The study highlights that the inversion process incurs higher time redundancy, suggesting that reducing redundant time steps can significantly accelerate editing tasks [14]. Method Overview - EEdit employs a training-free caching acceleration framework that utilizes output feature reuse to compress the inversion process time steps and control the frequency of area marking updates through region score rewards [15][17]. - The framework is designed to adapt to various input types for editing tasks, including reference images, prompt-based editing, and drag-region guidance [10][15]. Key Features of EEdit - EEdit achieves over 2.4X acceleration in inference speed compared to the unaccelerated version and can reach up to 10X speedup compared to other image editing methods [8][9]. - The framework addresses the computational waste caused by spatial and temporal redundancy, optimizing the editing process without compromising quality [9][10]. - EEdit supports multiple input guidance types, enhancing its versatility in image editing tasks [10]. Experimental Results - The performance of EEdit was evaluated on several benchmarks, demonstrating superior efficiency and quality metrics compared to existing methods [26][27]. - EEdit outperformed other methods in terms of PSNR, LPIPS, SSIM, and CLIP metrics, showcasing its competitive edge in both speed and quality [27][28]. - The spatial locality caching algorithm (SLoC) used in EEdit was found to be more effective than other caching methods, achieving better acceleration and foreground preservation [29].
物理学家靠生物揭开AI创造力来源:起因竟是“技术缺陷”
量子位· 2025-07-04 04:40
Core Viewpoint - The creativity exhibited by AI, particularly in diffusion models, is hypothesized to be a result of the model architecture itself, rather than a flaw or limitation [1][3][19]. Group 1: Background and Hypothesis - AI systems, especially diffusion models like DALL·E and Stable Diffusion, are designed to replicate training data but often produce novel images instead [3][4]. - Researchers have been puzzled by the apparent creativity of these models, questioning how they generate new samples rather than merely memorizing data [8][6]. - The hypothesis presented by physicists Mason Kamb and Surya Ganguli suggests that the noise reduction process in diffusion models may lead to information loss, akin to a puzzle missing its instructions [8][9]. Group 2: Mechanisms of Creativity - The study draws parallels between the self-assembly processes in biological systems and the functioning of diffusion models, particularly focusing on local interactions and symmetry [11][14]. - The concepts of locality and equivariance in diffusion models are seen as both limitations and sources of creativity, as they force the model to focus on smaller pixel groups without a complete picture [15][19]. - The researchers developed a system called the Equivariant Local Score Machine (ELS) to validate their hypothesis, which demonstrated a 90% accuracy in matching outputs of trained diffusion models [18][19]. Group 3: Implications and Further Questions - The findings suggest that the creativity of diffusion models may be an emergent property of their operational dynamics, rather than a separate, higher-level phenomenon [19][21]. - There remain questions regarding the creativity of other AI systems, such as large language models, which do not rely on the same mechanisms of locality and equivariance [21][22]. - The research indicates that both human and AI creativity may stem from an incomplete understanding of the world, leading to novel and valuable outputs [21][22].
画到哪,动到哪!字节跳动发布视频生成「神笔马良」ATI,已开源!
机器之心· 2025-07-02 10:40
Core Viewpoint - The article discusses the development of ATI, a new controllable video generation framework by ByteDance, which allows users to create dynamic videos by drawing trajectories on static images, transforming user input into explicit control signals for object and camera movements [2][4]. Group 1: Introduction to ATI - Angtian Wang, a researcher at ByteDance, focuses on video generation and 3D vision, highlighting the advancements in video generation tasks due to diffusion models and transformer architectures [1]. - The current mainstream methods face a significant bottleneck in providing effective and intuitive motion control for users, limiting creative expression and practical application [2]. Group 2: Methodology of ATI - ATI accepts two basic inputs: a static image and a set of user-drawn trajectories, which can be any shape, including lines and curves [6]. - The Gaussian Motion Injector encodes these trajectories into motion vectors in latent space, guiding the video generation process frame by frame [6][14]. - The model uses Gaussian weights to ensure that it can "see" the drawn trajectories and understand their relation to the generated video [8][14]. Group 3: Features and Capabilities - Users can draw trajectories for key actions like running or jumping, with ATI accurately sampling and encoding joint movements to generate natural motion sequences [19]. - ATI can handle up to 8 independent trajectories simultaneously, ensuring that object identities remain distinct during complex interactions [21]. - The system allows for synchronized camera movements, enabling users to create dynamic videos with cinematic techniques like panning and tilting [23][25]. Group 4: Performance and Applications - ATI demonstrates strong cross-domain generalization, supporting various artistic styles such as realistic films, cartoons, and watercolor renderings [28]. - Users can create non-realistic motion effects, such as flying or stretching, providing creative possibilities for sci-fi or fantasy scenes [29]. - The high-precision model based on Wan2.1-I2V-14B can generate videos comparable to real footage, while a lightweight version is available for real-time interactions in resource-constrained environments [30]. Group 5: Open Source and Community - The Wan2.1-I2V-14B model version of ATI has been open-sourced on Hugging Face, facilitating high-quality, controllable video generation for researchers and developers [32]. - Community support is growing, with tools like ComfyUI-WanVideoWrapper available to optimize model performance on consumer-grade GPUs [32].
免费约饭!加拿大ICML 2025,相聚机器之心人才晚宴
机器之心· 2025-07-01 09:34
Core Viewpoint - The AI field continues to develop rapidly in 2025, with significant breakthroughs in image and video generation technologies, particularly through diffusion models that enhance image synthesis quality and enable synchronized audio generation in video content [1][2]. Group 1: AI Technology Advancements - The use of diffusion models has led to unprecedented improvements in image synthesis quality, enhancing resolution, style control, and semantic understanding [2]. - Video generation technology has evolved, exemplified by Google's Veo 3, which achieves native audio synchronization, marking a significant advancement in video generation capabilities [2]. Group 2: Academic Collaboration and Events - The ICML conference, a leading academic event in the AI field, will take place from July 13 to July 19, 2025, in Vancouver, Canada, showcasing top research achievements [4]. - The "Yunfan・ICML 2025 AI Talent Meetup" is organized to facilitate informal discussions among professionals, focusing on cutting-edge technologies and talent dialogue [5][7]. Group 3: Event Details - The meetup will feature various engaging activities, including talks by young scholars, talent showcases, interactive experiences, institutional presentations, and networking dinners, aimed at fostering discussions on key issues in technology and application [7][8]. - The event is scheduled for July 15, 2025, from 16:00 to 20:30, with a capacity of 200 participants [8].
UofT、UBC、MIT和复旦等联合发布:扩散模型驱动的异常检测与生成全面综述
机器之心· 2025-06-30 23:48
扩散模型(Diffusion Models, DMs)近年来展现出巨大的潜力,在计算机视觉和自然语言处理等诸多任务中取得了显著进展,而异常检测(Anomaly Detection, AD)作为人工智能领域的关键研究任务,在工业制造、金融风控、医疗诊断等众多实际场景中发挥着重要作用。近期,来自多伦多大学、 不列颠哥伦比亚大学 、麻省理工学院、悉尼大学、卡迪夫大学和复旦大学等知名机构的研究者合作完成题为 "Anomaly Detection and Generation with Diffusion Models: A Survey" 的长文 综述,首次聚焦于 DMs 在异常检测与生成领域的应用。该综述系统性地梳理了图像、视频、时间序列、表格和多模态异常检测任务的最新进展并从扩散模型视角 提供了全面的分类体系,结合生成式 AI 的研究动向展望了未来趋势和发展机遇,有望引导该领域的研究者和从业者。 论文标题: Anomaly Detection and Generation with Diffusion Models: A Survey 论文链接: https://arxiv.org/pdf/2506.09368 ...
ICML 2025 Spotlight | 新理论框架解锁流匹配模型的引导生成
机器之心· 2025-06-28 02:54
Core Viewpoint - The article introduces a novel energy guidance theoretical framework for flow matching models, addressing the gap in energy guidance algorithms within this context and proposing various practical algorithms suitable for different tasks [2][3][27]. Summary by Sections Research Background - Energy guidance is a crucial technique in the application of generative models, ideally altering the distribution of generated samples to align with a specific energy function while maintaining adherence to the training set distribution [7][9]. - Existing energy guidance algorithms primarily focus on diffusion models, which differ fundamentally from flow matching models, necessitating a general energy guidance theoretical framework for flow matching [9]. Method Overview - The authors derive a general flow matching energy guidance vector field from the foundational definitions of flow matching models, leading to the formulation of three categories of practical, training-free energy guidance algorithms [11][12]. - The guidance vector field is designed to direct the original vector field towards regions of lower energy function values [12]. Experimental Results - Experiments were conducted on synthetic data, offline reinforcement learning, and image linear inverse problems, demonstrating the effectiveness of the proposed algorithms [20][22]. - In synthetic datasets, the Monte Carlo sampling-based guidance algorithm achieved results closest to the ground truth distribution, validating the correctness of the flow matching guidance framework [21]. - In offline reinforcement learning tasks, the Monte Carlo sampling guidance exhibited the best performance due to the need for stable guidance samples across different time steps [23]. - For image inverse problems, the Gaussian approximation guidance and GDM showed optimal performance, while the Monte Carlo sampling struggled due to high dimensionality [25]. Conclusion - The work fills a significant gap in energy guidance algorithms for flow matching models, providing a new theoretical framework and several practical algorithms, along with theoretical analysis and experimental comparisons to guide real-world applications [27].