Flow Matching
Search documents
ICLR 2026 | 当视频难以被表征:UCSD、HKUST等机构联合提出FlowRVS,用生成式流匹配重构视觉感知范式
机器之心· 2026-03-03 09:08
本文第一作者为王赞毅,本科毕业于西安交通大学,现为加州大学圣迭戈分校(UCSD)ECE 系硕士一年级学生。其主要研究方向为:视频理解,生成式建 模。本工作为作者在国家电网思极 AI 实验室(SGIT AI Lab) 实习期间的成果。 长期以来,计算机视觉领域陷入了一个 "表征(Representation)" 的执念。我们习惯设计各种精巧的 Encoder,试图将动态世界压缩成一组特征向量。然而,视频 作为现实的高维投影,其熵值之高、动态之复杂,让这种试图 "定格" 的表征显得力不从心。特别是在指代视频分割(RVOS)中,传统 "先定位、后分割" 范式遭 遇了信息坍缩的瓶颈 —— 一旦特征被压缩,细粒度的时空对应关系便随之瓦解。 如果换一种思路呢?如果不再执着于 "压缩" 和 "表征",而是利用生成式模型对物理规律的深刻理解去'重演'这个过程,是否能实现降维打击?在刚刚公布的 ICLR 2026 中,来自 SGIT AI Lab,UCSD, HKUST 等机构的研究团队给出了肯定的答案。他们提出的 FlowRVS ,跳出了传统'冻结骨干提取特征 + 独立解码器预 测'的桎梏。不同于以往将大模型仅仅视为一个特征 ...
直观理解Flow Matching生成式算法
自动驾驶之心· 2025-12-17 00:03
Core Viewpoint - The article discusses the Flow Matching algorithm, a generative model that simplifies the process of generating samples similar to a target dataset without complex mathematical concepts or derivations [3][4][12]. Algorithm Principle - Flow Matching is a generative model that aims to generate samples close to a given target set without requiring input [3][4]. - The algorithm learns a direction of movement from a source point to a target point, effectively guiding the generation process [14][16]. Training and Inference - During training, the model samples points along the line from source to target and averages the slopes from multiple connections to determine the direction of movement [17]. - In inference, the model starts from a noise point and iteratively moves towards the target, collapsing into a specific state as it approaches the target [17][18]. Code Implementation - The code provided demonstrates a simple implementation of the Flow Matching algorithm, including the generation of random input points and the prediction of slopes using a neural network [18][19]. - The model uses a vector field to predict the direction and speed of movement towards the target distribution [19][20]. Advanced Applications - The article mentions the adaptation of Flow Matching for conditional generation tasks, allowing for the generation of samples based on specific prompts or conditions [24][30]. - An example is given of generating handwritten digits from the MNIST dataset using Flow Matching, showcasing its versatility in different generative tasks [30][32]. Conclusion - Flow Matching presents a more efficient alternative to diffusion models in generative tasks, with applications in various fields including image generation and automated driving [12][43].
扩散规划器全新升级!清华Flow Planner:基于流匹配模型的博弈增强算法(NeurIPS'25)
自动驾驶之心· 2025-10-15 23:33
Core Insights - The article presents a new autonomous driving decision-making algorithm framework called Flow Planner, which improves upon the existing Diffusion Planner by effectively modeling advanced interactive behaviors in high-density traffic scenarios [1][4][22]. Group 1: Background and Challenges - One of the core challenges in autonomous driving planning is achieving safe and reliable human-like decision-making in dense and diverse traffic environments [3]. - Traditional rule-based methods lack generalization capabilities in dynamic traffic games, while learning-based methods struggle with limited high-quality training data and the need for effective game behavior modeling [6][8]. Group 2: Innovations of Flow Planner - Flow Planner introduces three key innovations: fine-grained trajectory tokenization, interaction-enhanced spatiotemporal fusion, and classifier-free guidance for trajectory generation [4][23]. - Fine-grained trajectory tokenization allows for better representation of trajectories by dividing them into overlapping segments, improving coherence and diversity in planning [8]. - The interaction-enhanced spatiotemporal fusion mechanism enables the model to effectively capture spatial interactions and temporal consistency among various traffic participants [9][13]. - Classifier-free guidance allows for flexible adjustment of model sampling distributions during inference, enhancing the generation of driving behaviors and strategies [10]. Group 3: Experimental Results - Flow Planner achieved state-of-the-art (SOTA) performance on the nuPlan benchmark, surpassing 90 points on the Val14 benchmark without relying on any rule-based prior or post-processing modules [11][14]. - In the newly proposed interPlan benchmark, Flow Planner significantly outperformed other baseline methods, demonstrating superior response strategies in high-density traffic and pedestrian crossing scenarios [15][20]. Group 4: Conclusion - The Flow Planner framework significantly enhances decision-making performance in complex traffic interactions through its innovative modeling approaches, showcasing strong potential for adaptability across various scenarios [22][23].
Diffusion Model扩散模型一文尽览!
自动驾驶之心· 2025-09-13 16:04
Core Viewpoint - The article discusses the mathematical principles behind diffusion models, emphasizing the importance of noise in the sampling process and how it contributes to generating diverse and realistic images. The key takeaway is that diffusion models leverage Langevin sampling to transition from one probability distribution to another, with noise being an essential component rather than a mere side effect [10][11][26]. Summary by Sections Section 1: Basic Concepts of Diffusion Models - The article introduces the foundational concepts related to diffusion models, focusing on the use of velocity vector fields to define ordinary differential equations (ODEs) and the mathematical representation of these fields through trajectories [4]. Section 2: Langevin Sampling - Langevin sampling is highlighted as a crucial method for approximating transitions between distributions. The process involves adding noise to the sampling, which allows for exploration of the probability space and prevents convergence to local maxima [10][11][14][26]. Section 3: Role of Noise - Noise is described as a necessary component in the diffusion process, enabling the model to generate diverse samples rather than converging to peak values. The article explains that without noise, the sampling process would only yield local maxima, limiting the diversity of generated outputs [26][28][31]. Section 4: Comparison with GANs - The article contrasts diffusion models with Generative Adversarial Networks (GANs), noting that diffusion models assign the task of diversity to noise, which alleviates issues like mode collapse that can occur in GANs [37]. Section 5: Training and Implementation - The training process for diffusion models involves using score matching and kernel density estimation (KDE) to learn the underlying data distribution. The article outlines the steps for training, including the generation of noisy samples and the calculation of gradients for optimization [64][65]. Section 6: Flow Matching Techniques - Flow matching is introduced as a method for optimizing the sampling process, with a focus on minimizing the distance between the learned velocity field and the true data distribution. The article discusses the equivalence of flow matching and optimal transport strategies [76][86]. Section 7: Mean Flow and Rectified Flow - Mean flow and rectified flow are presented as advanced techniques within the flow matching framework, emphasizing their ability to improve sampling efficiency and stability during the generation process [100][106].
从方法范式和应用场景上看强化与VLA/Flow Matching/机器人控制算法
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article discusses recent advancements in reinforcement learning (RL) and its applications in robotics, particularly focusing on the VLA (Vision-Language Action) models and diffusion policies, highlighting their potential to handle complex tasks that traditional RL struggles with [2][4][35]. Method Paradigms - Traditional RL and imitation learning combined with Sim2Real techniques are foundational approaches in robotics [3]. - VLA models differ fundamentally from traditional RL by using training data distributions to describe task processes and goals, allowing for the execution of more complex tasks [4][35]. - Diffusion Policy is a novel approach that utilizes diffusion models to generate continuous action sequences, demonstrating superior capabilities in complex task execution compared to traditional RL methods [4][5]. Application Scenarios - The article categorizes applications into two main types: basic motion control for humanoid and quadruped robots, and complex/long-range operational tasks [22][23]. - Basic motion control primarily relies on RL and Sim2Real, with current implementations still facing challenges in achieving fluid motion akin to human or animal movements [22]. - For complex tasks, architectures typically involve a pre-trained Vision Transformer (ViT) encoder and a large language model (LLM), utilizing diffusion or flow matching for action output [23][25]. Challenges and Future Directions - The article identifies key challenges in the field, including the need for better simulation environments, effective domain randomization, and the integration of external goal conditions [35]. - It emphasizes the importance of human intention in task definition and the limitations of current models in learning complex tasks without extensive human demonstration data [35][40]. - Future advancements may involve multi-modal input predictions for task goals and the potential integration of brain-machine interfaces to enhance human-robot interaction [35].
AI生图大洗牌!流匹配架构颠覆传统,一个模型同时接受文本和图像输入
量子位· 2025-05-30 05:01
Core Viewpoint - The article discusses the breakthrough of the new AI model FLUX.1 Kontext, which utilizes flow matching architecture to accept both text and image inputs, enabling advanced context generation and editing capabilities [2][3]. Group 1: Model Features - FLUX.1 Kontext offers two versions: the professional version for rapid iteration and the high-end version that improves adherence to prompts and consistency [7]. - The model has four key features: character consistency across scenes, localized editing, style reference for new scene generation, and minimal latency for interaction [11]. Group 2: Performance Comparison - Third-party platform Replicate conducted tests showing FLUX.1 Kontext outperforms OpenAI's 4o model in quality and cost-effectiveness, with better color accuracy [12]. Group 3: Editing Techniques - For image editing, maintaining character identity is crucial regardless of the size of changes made [15]. - Complex changes, such as adding characters or altering backgrounds, should be described in multiple steps for optimal results [18]. - Style transfer tasks benefit from specific art styles or artist references to achieve better outcomes [19]. Group 4: Text Editing Capabilities - The model supports adding, deleting, and modifying text on images, with specific guidelines for maintaining readability and layout [22][25]. - Clear instructions on which elements to retain are essential for effective text editing [25]. Group 5: User Guidance - Detailed and specific descriptions yield better results in editing tasks, emphasizing the importance of clarity in instructions [20][37]. - The article provides a summary of effective prompt techniques for using FLUX.1 Kontext, highlighting the need for precise language and structured editing steps [34][37].
Z Tech|对话CV泰斗何恺明新作研究团队,三位05后MIT本科生,Diffusion真的需要噪声条件吗?
Z Potentials· 2025-02-27 04:09
Core Viewpoint - The recent research led by renowned scholar He Kaiming and three MIT freshmen challenges the traditional understanding of noise conditioning in denoising models, suggesting that it may not be essential for model performance [1][3]. Group 1: Research Findings - The study demonstrates that removing noise conditioning from many mainstream denoising models results in only a modest degradation in performance [4]. - The newly designed unconditional model, uEDM, achieves a near-state-of-the-art FID score of 2.23 in the CIFAR-10 benchmark, only slightly behind the top noise-conditioned model, EDM, which has an FID score of 1.97 [2][6]. - The research provides a theoretical framework and experimental results that validate the stability of mainstream denoising models when noise conditioning is removed, indicating the non-necessity of traditional noise conditioning techniques in practical applications [3][5]. Group 2: Implications and Future Directions - The findings open avenues for reducing model computational complexity and inspire new model designs that do not rely on noise conditioning [3]. - The upcoming live lecture will feature discussions on generative models and potential development directions, including a Q&A session with the authors [2].