Workflow
扩散模型
icon
Search documents
ViT一作盛赞:这个中国开源“PS模型”强过Nano Banana
量子位· 2025-12-29 04:32
也就是说现在图片元素也支持精细化修改了: 梦瑶 发自 凹非寺 量子位 | 公众号 QbitAI 太香了太香了,妥妥完爆ChatGPT和Nano Banana! 刚刚,ViT核心作者、Meta超级智能团队成员 Lucas Beyer 连发三条帖子,怒赞通义千问不久前发布的开源模型 Qwen—Image— Layered 。 在他看来,这才是图像生成的正确打开方式~ 他还顺便自补了一句:这个模型方向自己其实也想做来着,只是太忙,一直没来得及动手……(笑) 实话实说,Qwen—Image—Layered模型确实不一般,因为它可以让我们真正实现ps级别的 拆图自由 。 Qwen—Image—Layered 模型的核心能力,就是专治「一图定生死」这事儿的。 它能将一张普通图片分解成多个包含透明度信息的 RGBA分离图层 ,实现真正意义上的图片素材的可编辑性。 光说概念有点抽象,咱直接看例子~ 连网友们看了 模型效果后都不禁感叹:咋有种开源PhotoShop的感觉,amazing啊~ 所以,这套让Lucas Beyer反复点赞的模型到底强在哪儿,咱一起来看! 图片也能像PS一样拆拆拆了 如果说Nano Banana技能点 ...
市场正在惩罚只懂理论的端到端算法工程师......
自动驾驶之心· 2025-12-29 01:07
Core Insights - The article discusses the current challenges in the automotive industry regarding the recruitment of algorithm talent for end-to-end production roles, highlighting a gap between the skills of candidates and the high salary expectations for these positions [1] - A new course titled "End-to-End Practical Class for Mass Production" has been designed to address this gap, focusing on essential algorithms and practical applications in autonomous driving [1] Course Overview - The course is structured into eight chapters, covering various aspects of end-to-end algorithms, including the integration of perception tasks and learning-based control algorithms [6] - It emphasizes the importance of understanding both one-stage and two-stage end-to-end frameworks, with practical examples and real-world applications [7][8] - Key algorithms discussed include reinforcement learning, trajectory optimization, and spatial-temporal planning, which are crucial for the mass production of autonomous driving systems [10][12] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving technologies, including familiarity with algorithms such as reinforcement learning and diffusion models [14][16] - It is designed to be accessible even to those with weaker foundations, as the instructor will provide guidance to help participants quickly get up to speed [14] Course Logistics - The course will commence on November 30 and is expected to last for three months, featuring offline video lectures and online Q&A sessions [14][17] - Participants are required to have a GPU with a recommended capability of 4090 or higher, along with a basic understanding of Python and PyTorch [16]
刚做了一份世界模型的学习路线图,面向初学者......
自动驾驶之心· 2025-12-25 03:24
Core Viewpoint - The article discusses the distinction between world models and end-to-end models in autonomous driving, clarifying that world models are not a specific technology but rather a category of models with certain capabilities. It emphasizes the trend in the industry towards using world models for closed-loop simulation to address the high costs associated with corner cases in autonomous driving [2]. Course Overview - The course on world models in autonomous driving is structured into six chapters, covering the introduction, background knowledge, discussions on general world models, video generation-based models, OCC-based models, and job-related insights in the industry [5][6][7][8][9]. Chapter Summaries - **Chapter 1: Introduction to World Models** This chapter outlines the relationship between world models and end-to-end autonomous driving, discussing the development history and current applications of world models, as well as various streams such as pure simulation, simulation plus planning, and generating sensor inputs [5]. - **Chapter 2: Background Knowledge** This chapter covers foundational knowledge related to world models, including scene representation, Transformer technology, and BEV perception, which are crucial for understanding subsequent chapters [6]. - **Chapter 3: General World Models** Focuses on popular general world models like Marble from Li Fei-Fei's team and Genie 3 from DeepMind, discussing their core technologies and design philosophies [7]. - **Chapter 4: Video Generation-Based World Models** This chapter delves into video generation algorithms, starting with GAIA-1 & GAIA-2 and extending to recent works like UniScene and OpenDWM, highlighting both classic and cutting-edge advancements in this area [8]. - **Chapter 5: OCC-Based World Models** Concentrates on OCC generation algorithms, discussing three major papers and a practical project, emphasizing the potential for these methods to extend into vehicle trajectory planning [9]. - **Chapter 6: World Model Job Topics** This chapter shares practical insights from the instructor's experience, addressing industry applications, pain points, and interview preparation for positions related to world models [9]. Learning Outcomes - The course aims to provide a comprehensive understanding of world models in autonomous driving, equipping participants with the knowledge to achieve a level comparable to one year of experience as a world model algorithm engineer [10].
下周开课!我们设计了一份自动驾驶世界模型学习路线图....
自动驾驶之心· 2025-12-24 09:22
Core Viewpoint - The article discusses the distinction between world models and end-to-end models in autonomous driving, emphasizing that world models are a means to achieve end-to-end autonomous driving rather than a specific technology [2]. Summary by Sections Chapter 1: Introduction to World Models - This chapter provides an overview of the relationship between world models and end-to-end autonomous driving, covering the development history and current applications of world models. It introduces various types of world models, including pure simulation, simulation plus planning, and those generating sensor inputs and perception results, along with their industry applications and relevant datasets [5]. Chapter 2: Background Knowledge of World Models - The second chapter focuses on the foundational knowledge necessary for understanding world models, starting with scene representation and expanding to technologies like Transformer and BEV perception. It highlights key technical terms frequently encountered in job interviews related to world models [6][11]. Chapter 3: Discussion on General World Models - This chapter centers on general world models and recent popular works in autonomous driving, including models from Li Fei-Fei's team (Marble), DeepMind (Genie 3), and Meta (JEPA). It also discusses the widely talked-about VLA+ world model algorithms and Tesla's latest world model simulator shared at ICCV [7]. Chapter 4: Video Generation-Based World Models - The fourth chapter focuses on video generation algorithms, which are currently the most researched in both academia and industry. It covers classic works like GAIA-1 & GAIA-2 from Wayve and recent advancements such as UniScene and OpenDWM, providing a comprehensive view of the field's progress [8]. Chapter 5: OCC-Based World Models - This chapter discusses OCC generation algorithms, explaining three major papers and a practical project. These methods can be easily extended for vehicle trajectory planning, contributing to end-to-end solutions [9]. Chapter 6: World Model Job Topics - The final chapter shares practical insights from the instructor's years of experience, addressing the application of world models in the industry, existing pain points, and how to prepare for related job interviews, focusing on what companies prioritize [10]. Course Outcomes - The course aims to advance understanding of end-to-end autonomous driving, equipping participants with knowledge of world model technologies, including video generation and OCC generation methods, and preparing them for roles in the autonomous driving industry [10][13].
现场围观腾讯广告算法大赛,我都想入职了
量子位· 2025-12-24 05:14
Core Insights - The article discusses Tencent's algorithm competition, highlighting its significance in attracting talent and providing practical experience in cutting-edge AI technologies [1][28][43] Group 1: Competition Overview - The competition offered substantial rewards, including a total prize pool of 3.8 million yuan, with the champion receiving 2 million yuan and all participants gaining access to valuable resources like computing power [32][34] - The competition attracted over 8,400 students and 2,800 teams from nearly 30 countries, showcasing its global reach and influence [34] Group 2: Technical Focus - The competition's theme, "full-modal generative recommendation," addresses advanced challenges in advertising and recommendation systems, emphasizing the integration of various data types such as text, images, and videos [5][11] - Participants faced real-world challenges, including data noise, alignment issues, and the need for efficient modeling of user behavior over long sequences [13][41] Group 3: Talent Acquisition Strategy - Tencent's approach to the competition serves as a recruitment strategy, allowing the company to identify and engage with top talent in a practical setting rather than traditional recruitment methods [39][42] - The competition's structure inherently filters candidates, ensuring that only those capable of handling complex data and modeling challenges progress to the final stages [40][41] Group 4: Industry Context - The competition reflects Tencent's established AI technology framework, which has been validated through real business applications, indicating the company's commitment to innovation and talent development [29][30] - The article notes the competitive landscape for talent in the AI sector, with companies like Tencent offering attractive employment packages and support programs to attract young professionals [44][46]
布局控制+身份一致:浙大提出ContextGen,实现布局锚定多实例生成新SOTA
机器之心· 2025-12-20 04:45
Core Insights - The article discusses the advancements in image generation, particularly focusing on the challenges in Multi-Instance Image Generation (MIG), which include layout control and identity preservation [2][5]. Group 1: ContextGen Framework - ContextGen is introduced as a new framework based on Diffusion Transformer (DiT) aimed at addressing the challenges of layout control and identity preservation in MIG tasks [5][6]. - The framework employs a dual-core mechanism that operates on a unified context token sequence, enhancing both layout and identity fidelity [8][10]. Group 2: Mechanisms of ContextGen - The Contextual Layout Anchoring (CLA) mechanism focuses on global context guidance, utilizing user-designed or model-generated layout images to ensure precise global layout control and initial identity information [10]. - The Identity Consistency Injection (ICA) mechanism injects identity information from high-fidelity reference images into corresponding target locations, ensuring consistency across multiple instances [12]. Group 3: Data Foundation - The IMIG-100K dataset is introduced as the first large-scale, detailed annotated dataset designed for image-guided multi-instance generation tasks, providing various difficulty levels and detailed layout and identity annotations [14]. Group 4: Performance Optimization - ContextGen incorporates a reinforcement learning phase based on preference optimization (DPO) to encourage creativity and diversity in generated images, moving beyond rigid replication of layout content [17]. Group 5: Experimental Validation - ContextGen demonstrates superior performance in quantitative and qualitative evaluations, surpassing all open-source models and matching closed-source commercial models in identity consistency [21][25]. - In the LAMICBench++ benchmark, ContextGen achieved an average score improvement of +1.3% over existing open-source models, showcasing its capabilities in complex multi-instance scenarios [21]. Group 6: User Interaction - A user-friendly front-end interface is included in the project, allowing users to upload reference images, add new materials via text, and design layouts through drag-and-drop functionality [32]. Group 7: Future Outlook - The ReLER team plans to further optimize the model architecture and explore diverse user interaction methods to meet broader application needs, emphasizing the importance of understanding user intent and multimodal references [36].
端到端落地中可以参考的七个Project
自动驾驶之心· 2025-12-19 00:05
Core Viewpoint - The article emphasizes the importance of end-to-end production in autonomous driving technology, highlighting the need for practical experience in various algorithms and applications to address real-world challenges in the industry [2][7]. Course Overview - The course is designed to provide in-depth knowledge on end-to-end production techniques, focusing on key algorithms such as one-stage and two-stage frameworks, reinforcement learning, and trajectory optimization [2][4]. - It includes practical projects that cover the entire process from theory to application, ensuring participants gain hands-on experience [2][12]. Instructor Background - The instructor, Wang Lu, is a top-tier algorithm expert with a strong academic background and extensive experience in developing and implementing advanced algorithms for autonomous driving [3]. Course Structure - The course consists of eight chapters, each focusing on different aspects of end-to-end algorithms, including: 1. Overview of end-to-end tasks and integration of perception and control systems [7]. 2. Two-stage end-to-end algorithm frameworks and their advantages [8]. 3. One-stage end-to-end algorithms with a focus on performance [9]. 4. Application of navigation information in autonomous driving [10]. 5. Introduction to reinforcement learning algorithms and training strategies [11]. 6. Optimization of trajectory outputs using various algorithms [12]. 7. Post-processing strategies for ensuring reliable outputs [13]. 8. Sharing of production experiences and strategies for real-world applications [14]. Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, including familiarity with reinforcement learning and diffusion models [15][17].
全网破防,AI“手指难题”翻车逼疯人类,6根手指,暴露Transformer致命缺陷
3 6 Ke· 2025-12-15 12:39
Core Insights - The recent "finger counting problem" highlights a significant flaw in AI models, particularly those based on the Transformer architecture, which struggle with visual reasoning and understanding discrete structures [1][50]. Group 1: AI Performance Issues - AI models, such as Nano Banana Pro and GPT-5.2, consistently fail to count the correct number of fingers on a six-fingered hand, often defaulting to the assumption of five fingers due to their training data bias [2][6][9]. - The inability of AI to recognize the six fingers is attributed to its reliance on basic shapes and traditional associations rather than precise visual recognition [21][32]. Group 2: Limitations of Transformer Architecture - The Transformer architecture's parallel computing capabilities, while beneficial for speed, hinder the model's ability to perform multi-step logical reasoning, leading to mechanical and fragmented thinking [37][39]. - AI's lack of a coherent thought process when faced with anomalies, such as the six-fingered hand, results in a failure to reassess and adjust its responses [39][46]. Group 3: Need for Advanced Models - To address the shortcomings revealed by the finger counting problem, there is a call for more advanced architectures and diverse training data that can enhance AI's understanding of complex visual details [50]. - The current models' reliance on strong statistical priors from training data limits their ability to understand and generate precise structures, indicating a need for hybrid modeling approaches that combine different AI techniques [45][50].
时隔一年DiffusionDrive升级到v2,创下了新纪录!
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the upgrade of DiffusionDrive to version 2, highlighting its advancements in end-to-end autonomous driving trajectory planning through the integration of reinforcement learning to address the challenges of diversity and sustained high quality in trajectory generation [1][3][10]. Background Review - The shift towards end-to-end autonomous driving (E2E-AD) has emerged as traditional tasks like 3D object detection and motion prediction have matured. Early methods faced limitations in modeling, often generating single trajectories without alternatives in complex driving scenarios [5][10]. - Previous diffusion models applied to trajectory generation struggled with mode collapse, leading to a lack of diversity in generated behaviors. DiffusionDrive introduced a Gaussian Mixture Model (GMM) to define prior distributions for initial noise, promoting diverse behavior generation [5][13]. Methodology - DiffusionDriveV2 introduces a novel framework that utilizes reinforcement learning to overcome the limitations of imitation learning, which previously led to a trade-off between diversity and sustained high quality in trajectory generation [10][12]. - The framework incorporates intra-anchor GRPO and inter-anchor truncated GRPO to manage advantage estimation within specific driving intentions, preventing mode collapse by avoiding inappropriate comparisons between different intentions [9][12][28]. - The method employs scale-adaptive multiplicative noise to enhance exploration while maintaining trajectory smoothness, addressing the inherent scale inconsistency between proximal and distal segments of trajectories [24][39]. Experimental Results - Evaluations on the NAVSIM v1 and NAVSIM v2 datasets demonstrated that DiffusionDriveV2 achieved state-of-the-art performance, with a PDMS score of 91.2 on NAVSIM v1 and 85.5 on NAVSIM v2, significantly outperforming previous models [10][33]. - The results indicate that DiffusionDriveV2 effectively balances trajectory diversity and sustained quality, achieving optimal performance in closed-loop evaluations [38][39]. Conclusion - The article concludes that DiffusionDriveV2 successfully addresses the inherent challenges of imitation learning in trajectory generation, achieving an optimal trade-off between planning quality and diversity through innovative reinforcement learning techniques [47].
随到随学!端到端与VLA自动驾驶小班课正式结课
自动驾驶之心· 2025-12-09 19:00
Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry has two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach that directly models vehicle trajectories from sensor inputs [1]. - Since last year, the single-stage end-to-end development has rapidly advanced, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based single-stage methods [3][5]. - Major players in the autonomous driving sector, including both solution providers and car manufacturers, are focusing on self-research and production of end-to-end autonomous driving technologies [3]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, aimed at teaching cutting-edge algorithms in both single-stage and two-stage end-to-end approaches, with a focus on the latest developments in the industry and academia [5][14]. - The course is structured into several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge on various technologies such as VLA, diffusion models, and reinforcement learning [8][9]. - The second chapter is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [9]. Group 3: Technical Focus Areas - The course covers various subfields of single-stage end-to-end methods, including perception-based (UniAD), world model-based, diffusion model-based, and the currently popular VLA-based approaches [10][12]. - The curriculum includes practical assignments, such as RLHF fine-tuning, and aims to provide students with hands-on experience in building and experimenting with pre-trained and reinforcement learning modules [11][12]. - The course emphasizes the importance of understanding BEV perception, multi-modal large models, and the latest advancements in diffusion models, which are crucial for the future of autonomous driving [12][16].