Diffusion Model

Search documents
2025秋招开始了,这一段时间有些迷茫。。。
自动驾驶之心· 2025-07-08 07:53
Core Viewpoint - The article discusses the current trends and opportunities in the fields of autonomous driving and embodied intelligence, emphasizing the need for strong technical skills and knowledge in cutting-edge technologies for job seekers in these areas [3][4]. Group 1: Job Market Insights - The job market for autonomous driving and embodied intelligence is competitive, with a high demand for candidates with strong backgrounds and technical skills [2][3]. - Companies are increasingly looking for expertise in advanced areas such as end-to-end models, visual language models (VLM), and reinforcement learning [3][4]. - There is a saturation of talent in traditional robotics, but many startups in the robotics sector are rapidly growing and attracting significant funding [3][4]. Group 2: Learning and Development - The article encourages individuals to enhance their technical skills, particularly in areas like SLAM (Simultaneous Localization and Mapping) and ROS (Robot Operating System), which are relevant to robotics and embodied intelligence [3][4]. - A community platform is mentioned that offers resources such as video courses, hardware learning materials, and job information, aiming to build a large network of professionals in intelligent driving and embodied intelligence [5]. Group 3: Technical Trends - The article highlights four major technical directions in the industry: visual language models, world models, diffusion models, and end-to-end autonomous driving [8]. - It provides links to various resources and papers related to these technologies, indicating a focus on the latest advancements and applications in the field [9][10].
双非研究生,今年找工作有些迷茫。。。
自动驾驶之心· 2025-06-30 05:51
Core Viewpoint - The article emphasizes the importance of advanced skills and knowledge in the fields of autonomous driving and embodied intelligence, highlighting the need for candidates with strong backgrounds to meet industry demands. Group 1: Industry Trends - The demand for talent in autonomous driving and embodied intelligence is increasing, with a focus on cutting-edge technologies such as SLAM, ROS, and large models [3][4]. - Many companies are transitioning from traditional methods to more advanced techniques, indicating a shift in the required skill sets for job seekers [3][4]. - The article notes that while there is a saturation of talent in certain areas, the growth of startups in robotics presents new opportunities for learning and development [3][4]. Group 2: Learning and Development - The article encourages individuals to enhance their technical skills, particularly in areas related to robotics and embodied intelligence, which are seen as the forefront of technology [3][4]. - It mentions the availability of resources and community support for learning, including access to courses, hardware, and job information through platforms like Knowledge Planet [5][6]. - The community aims to create a comprehensive ecosystem for knowledge sharing and recruitment in the fields of intelligent driving and embodied intelligence [5][6]. Group 3: Technical Directions - The article outlines four major technical directions in the industry: visual large language models, world models, diffusion models, and end-to-end autonomous driving [7]. - It highlights the importance of staying updated with the latest research and developments in these areas, providing links to various resources and papers for further exploration [8][9].
100+自动驾驶数据集,这5个你总得知道吧?
自动驾驶之心· 2025-06-22 01:35
Core Viewpoint - The article emphasizes the growing importance of autonomous driving technology and highlights the availability of over 100 high-quality datasets for developers and researchers in the field. It introduces five key datasets that cover various tasks from perception to visual odometry, providing valuable resources for both beginners and experienced engineers [2]. Dataset Summaries 1. KITTI Dataset - The KITTI dataset is one of the most classic and widely used benchmark datasets in the autonomous driving field. It was collected in Karlsruhe, Germany, using high-precision sensors such as stereo color/gray cameras, Velodyne 3D LiDAR, and GPS/IMU. The dataset includes annotations for various perception tasks, including stereo vision, optical flow, visual odometry, and 3D object detection and tracking, making it a standard for evaluating vehicle vision algorithms [3]. 2. nuScenes Dataset - nuScenes is a large-scale multi-sensor dataset released by Motional, covering 1,000 continuous driving scenes in Boston and Singapore, totaling approximately 15 hours of data. It includes a full suite of sensors: six cameras, five millimeter-wave radars, one top-mounted LiDAR, and IMU/GPS. The dataset provides around 1.4 million high-resolution camera images and 390,000 LiDAR scans, annotated with 3D bounding boxes for 23 object categories, making it suitable for research on complex urban road scenarios [5][7]. 3. Waymo Open Dataset - The Waymo Open Dataset, released by Google Waymo, is one of the largest open data resources for autonomous driving. It consists of two main parts: a perception dataset with 2,030 scenes of high-resolution camera and LiDAR data, and a motion dataset with 103,354 vehicle trajectories and corresponding 3D map information. This extensive multi-sensor dataset covers various times, weather conditions, and urban environments, serving as a benchmark for target detection, tracking, and trajectory prediction research [10][12]. 4. PathTrack Dataset - PathTrack is a dataset focused on person tracking, containing over 15,000 trajectories across 720 sequences. It utilizes a re-trained existing person matching network, significantly reducing the classification error rate. The dataset is suitable for 2D/3D object detection, tracking, and trajectory prediction tasks [13][14][15]. 5. ApolloScape Dataset - ApolloScape, released by Baidu Apollo, is a massive autonomous driving dataset characterized by its large volume and high annotation accuracy. It reportedly exceeds similar datasets in size by over ten times, containing hundreds of thousands of high-resolution images with pixel-level semantic segmentation annotations. ApolloScape defines 26 different semantic categories and includes complex road scenarios, making it applicable for perception, map construction, and simulation training [17][19].
数据减少超千倍,500 美金就可训练一流视频模型,港城、华为Pusa来了
机器之心· 2025-06-19 02:28
Core Viewpoint - The article discusses the revolutionary advancements in video generation through the introduction of the Frame-aware Video Diffusion Model (FVDM) and its practical application in the Pusa project, which significantly reduces training costs and enhances video generation capabilities [2][3][37]. Group 1: FVDM and Pusa Project - FVDM introduces a vectorized timestep variable (VTV) that allows each frame to have an independent temporal evolution path, addressing the limitations of traditional scalar timesteps in video generation [2][18]. - The Pusa project, developed in collaboration with Huawei's Hong Kong Research Institute, serves as a direct application and validation of FVDM, exploring a low-cost method for fine-tuning large-scale pre-trained video models [3][37]. - Pusa achieves superior results compared to the official Wan I2V model while reducing training costs by over 200 times (from at least $100,000 to $500) and data requirements by over 2500 times [5][37]. Group 2: Technical Innovations - The Pusa project utilizes non-destructive fine-tuning on pre-trained models like Wan-T2V 14B, allowing for effective video generation without compromising the original model's capabilities [5][29]. - The introduction of a probabilistic timestep sampling training strategy (PTSS) in FVDM enhances convergence speed and improves performance compared to the original model [30][31]. - Pusa's VTV mechanism enables diverse video generation tasks by allowing different frames to have distinct noise perturbation controls, thus facilitating more nuanced video generation [35][36]. Group 3: Community Engagement and Future Prospects - The complete codebase, training datasets, and training code for Pusa have been open-sourced to encourage community contributions and collaboration, aiming to enhance performance and explore new possibilities in video generation [17][37]. - The article emphasizes the potential of Pusa to lead the video generation field into a new era characterized by low costs and high flexibility [36][37].
挑战 next token prediction,Diffusion LLM 够格吗?
机器之心· 2025-06-08 02:11
Group 1 - The article discusses the potential of Diffusion LLMs, particularly Gemini Diffusion, as a significant breakthrough in AI, challenging traditional autoregressive models [3][4][5] - Gemini Diffusion demonstrates high generation efficiency, achieving an average sampling speed of 1479 TPS and up to 2000 TPS in encoding tasks, outperforming Gemini 2.0 Flash-Lite by 4-5 times [4][6] - The parallel generation mechanism of the diffusion architecture allows for efficient processing, which could lead to reduced computational costs compared to autoregressive models [6][7] Group 2 - Mary Meeker emphasizes that the speed of AI development surpasses that of the internet era, highlighting the cost disparity between AI model training and inference [1][2] - The article suggests that the rise of open-source models in China may impact the global supply chain, indicating a shift in competitive dynamics within the industry [1][2] - The balance between computational investment and commercial returns is crucial for enterprises as AI inference costs decline [1][2]
三位顶流AI技术人罕见同台,谈了谈AI行业最大的「罗生门」
3 6 Ke· 2025-05-28 11:59
Core Insights - The AI industry is currently experiencing a significant debate over the effectiveness of pre-training models versus first principles, with notable figures like Ilya from OpenAI suggesting that pre-training has reached its limits [1][2] - The shift from a consensus-driven approach to exploring non-consensus methods is evident, as companies and researchers seek innovative solutions in AI [6][7] Group 1: Industry Trends - The AI landscape is witnessing a transition from a focus on pre-training to exploring alternative methodologies, with companies like Sand.AI and NLP LAB leading the charge in applying multi-modal architectures to language and video models [3][4] - The emergence of new models, such as Dream 7B, demonstrates the potential of applying diffusion models to language tasks, outperforming larger models like DeepSeek V3 [3][4] - The consensus around pre-training is being challenged, with some experts arguing that it is not yet over, as there remains untapped data that could enhance model performance [38][39] Group 2: Company Perspectives - Ant Group's Qwen team, led by Lin Junyang, has faced criticism for being conservative, yet they emphasize that their extensive experimentation has led to valuable insights, ultimately reaffirming the effectiveness of the Transformer architecture [5][15] - The exploration of Mixture of Experts (MoE) models is ongoing, with the team recognizing the potential for scalability while also addressing the challenges of training stability [16][20] - The industry is increasingly focused on optimizing model efficiency and effectiveness, with a particular interest in achieving a balance between model size and performance [19][22] Group 3: Technical Innovations - The integration of different model architectures, such as using diffusion models for language generation, reflects a broader trend of innovation in AI [3][4] - The challenges of training models with long sequences and the need for effective optimization strategies are critical areas of focus for researchers [21][22] - The potential for future breakthroughs lies in leveraging increased computational power to revisit previously unviable techniques, suggesting a cycle of innovation driven by advancements in hardware [40][41]
ICML 2025 Spotlight | 用傅里叶分解探讨图像对抗扰动,代码已开源
机器之心· 2025-05-18 04:25
Core Viewpoint - The article discusses a novel approach to adversarial purification in computer vision, focusing on the frequency domain to effectively separate adversarial perturbations from clean images while preserving semantic information [5][21]. Research Background - Adversarial samples pose significant challenges to the safety and robustness of models in computer vision, necessitating effective adversarial purification techniques to restore original clean images [5]. - Existing adversarial purification methods are categorized into training-based and diffusion model-based approaches, with the latter offering stronger generalization capabilities without requiring extensive training data [5][6]. Motivation and Theoretical Analysis - The key to successful adversarial purification lies in eliminating adversarial perturbations while retaining the semantic information of the original image [9]. - Current strategies that add noise to mask adversarial perturbations often excessively damage the semantic content of the original image [9]. - The study employs Fourier decomposition to analyze the distribution characteristics of adversarial perturbations, revealing that they predominantly affect high-frequency components, while low-frequency components are more robust [9][12]. Methodology - A filter is constructed to retain low-frequency amplitude spectrum components, which are less affected by adversarial perturbations, while allowing for the replacement of these components with those from the original clean image [14][15]. - The phase spectrum is also addressed, as it is influenced by adversarial perturbations across all frequency components; thus, a projection method is used to maintain the integrity of the phase information [16][17]. Experimental Results - The proposed method demonstrates improved performance in both standard and robust accuracy metrics compared to state-of-the-art (SOTA) methods on datasets such as CIFAR10 and ImageNet [18][19]. - Visualizations indicate that the purified images closely resemble the original clean images, confirming the effectiveness of the proposed approach [20]. Conclusion - While significant progress has been made in preserving semantic information and removing adversarial perturbations, further exploration into more effective image decomposition methods and deeper theoretical explanations remains a future research direction [21].
CVPR 2025 Oral | DiffFNO:傅里叶神经算子助力扩散,开启任意尺度超分辨率新篇章
机器之心· 2025-05-04 04:57
Core Viewpoint - The article discusses the development of DiffFNO, a novel method that enhances diffusion models with neural operators to achieve high-quality and efficient super-resolution (SR) for images at any continuous scaling factor, addressing the challenges of traditional models [2][4]. Group 1: Methodology Overview - DiffFNO consists of three main components: Weighted Fourier Neural Operator (WFNO), Gated Fusion Mechanism, and Adaptive ODE Solver, which collectively improve the quality and efficiency of image reconstruction [2][5]. - The WFNO captures global information through frequency domain convolution and amplifies high-frequency components using learnable frequency weights, resulting in a PSNR improvement of approximately 0.3–0.5 dB in high-magnification tasks [10]. - The Gated Fusion Mechanism integrates a lightweight attention operator (AttnNO) to capture local spatial features, allowing for a flexible combination of spectral and spatial information [12][13]. Group 2: Adaptive ODE Solver - The Adaptive ODE Solver transforms the diffusion model's reverse process from a stochastic SDE to a deterministic ODE, significantly reducing the number of steps required for denoising from over a thousand to about thirty, thus enhancing inference speed [15]. - This method maintains image quality while halving the inference time from 266 ms to approximately 141 ms, even performing better at larger scaling factors [15]. Group 3: Experimental Validation - DiffFNO outperforms various state-of-the-art (SOTA) methods by 2–4 dB in PSNR across multiple benchmark datasets, particularly excelling in high magnification scenarios such as ×8 and ×12 [17][20]. - The method retains the complete Fourier spectrum, balancing overall image structure and local detail, and employs learnable frequency weights to dynamically adjust the influence of different frequency bands [18]. Group 4: Conclusion - The introduction of DiffFNO provides a new approach to reconcile the trade-off between high precision and low computational cost in super-resolution tasks, making it suitable for fields requiring high image quality, such as medical imaging, exploration, and gaming [22].