Workflow
生成式建模
icon
Search documents
ICLR 2026 | 当视频难以被表征:UCSD、HKUST等机构联合提出FlowRVS,用生成式流匹配重构视觉感知范式
机器之心· 2026-03-03 09:08
Core Insights - The article discusses the limitations of traditional representation methods in video understanding, particularly in referring video segmentation (RVOS), where the "locate then segment" paradigm faces challenges due to information collapse [2][3] - A new approach called FlowRVS is introduced, which leverages generative models to reshape segmentation tasks into a flow matching process, enhancing performance and representing a paradigm shift in visual perception [3][11] Group 1: Traditional Methods and Their Limitations - Traditional models focus on compressing video features into a set of vectors, which often leads to loss of fine-grained spatiotemporal relationships [2] - The attempt to directly map high-dimensional video features to binary masks in a single step resulted in poor performance, highlighting the challenges of bridging significant information gaps [8][10] - The failure of the "noise-to-mask" approach demonstrated the importance of retaining high-entropy spatial and textural details from the video [10] Group 2: FlowRVS and Its Innovations - FlowRVS shifts the focus from absolute mask prediction to predicting relative changes in video features, leading to a significant performance increase [11] - The model establishes a Video-to-Mask Flow paradigm, learning a deterministic trajectory that guides high-dimensional features smoothly into target masks, achieving a state-of-the-art score of 60.6 [11][21] - The introduction of Boundary Bias Sampling (BBS) allows the model to focus on the critical starting point of the flow, resulting in a 10-point performance boost [16][17] Group 3: Performance Metrics and Results - FlowRVS achieved a new state-of-the-art score of 51.1 J&F on the MeViS benchmark, demonstrating its effectiveness even against larger models [21] - The model exhibited impressive zero-shot capabilities, scoring 73.3 on the unseen Ref-DAVIS17 dataset, showcasing its generalization power [21] - FlowRVS maintains stability in long sequences, effectively addressing trajectory drift issues, which is a significant advantage over traditional models [23] Group 4: Theoretical Implications of Flow Matching - FlowRVS exemplifies the universality of Flow Matching theory, bridging various modalities and demonstrating that optimal transport paths can be established between different probability distributions [26] - The success of FlowRVS suggests a future where detection, segmentation, and generation tasks can be unified under a single elegant ODE framework, breaking down the barriers between different modalities [26]
闭环碰撞率爆降50%!DistillDrive:异构多模态蒸馏端到端新方案
自动驾驶之心· 2025-08-11 23:33
Core Insights - The article discusses the development of DistillDrive, an end-to-end autonomous driving model that significantly reduces collision rates by 50% and improves closed-loop performance by 3 percentage points compared to baseline models [2][7]. Group 1: Model Overview - DistillDrive utilizes a knowledge distillation framework to enhance multi-modal motion feature learning, addressing the limitations of existing models that overly focus on ego-vehicle status [2][6]. - The model incorporates a structured scene representation as a teacher model, leveraging diverse planning instances for multi-objective learning [2][6]. - Reinforcement learning is introduced to optimize the mapping from states to decisions, while generative modeling is used to construct planning-oriented instances [2][6]. Group 2: Experimental Validation - The model was validated on the nuScenes and NAVSIM datasets, demonstrating a 50% reduction in collision rates and a 3-point improvement in performance metrics [7][37]. - The nuScenes dataset consists of 1,000 driving scenes, while the NAVSIM dataset enhances perception capabilities with high-quality annotations and complex scenarios [33][36]. Group 3: Performance Metrics - DistillDrive outperformed existing models, achieving lower collision rates and reduced L2 error compared to SparseDrive, indicating the effectiveness of diversified imitation learning [37][38]. - The teacher model exhibited superior performance, confirming the effectiveness of reinforcement learning in optimizing state space [37][39]. Group 4: Future Directions - Future work aims to integrate world models with language models to further enhance planning performance and employ more effective reinforcement learning methods [54][55].
具身领域LLM结合强化学习与世界模型工作汇总
具身智能之心· 2025-07-29 06:15
Core Viewpoint - The article discusses recent advancements in the field of embodied intelligence, particularly focusing on the integration of large language models (LLMs) with reinforcement learning and world models, highlighting several notable research papers from 2024 [2][3]. Group 1: UniSim - UniSim aims to learn general real-world interactive simulators through generative modeling, revealing that natural datasets can provide diverse advantages for learning simulators [3]. - The research demonstrates that integrating various datasets allows for the simulation of high-level commands and low-level controls, enabling zero-shot application in real-world scenarios [3]. Group 2: Robust Agents - The study from Google DeepMind asserts that causal reasoning is essential for robust and general AI, concluding that agents capable of satisfying regret bounds must learn approximate causal models [5]. - This finding has significant implications for transfer learning and causal inference [5]. Group 3: MAMBA - MAMBA introduces an efficient world model approach for meta-reinforcement learning, addressing sample efficiency issues prevalent in current methods [8]. - The framework shows a remarkable improvement in sample efficiency, achieving up to 15 times better performance in high-dimensional tasks [8]. Group 4: EMMA - EMMA leverages LLMs trained in text-based worlds to guide the training of visual world agents, enhancing their ability to interact with dynamic environments [10]. - The approach results in a significant success rate improvement of 20%-70% in diverse tasks compared to existing VLM agents [10]. Group 5: Text2Reward - The Text2Reward framework automates the generation of dense reward functions using LLMs, addressing the challenges of reward function design in reinforcement learning [13][14]. - The method demonstrates superior performance in 13 out of 17 tasks, achieving over 94% success in new motion behaviors [14]. Group 6: Online Continual Learning - The research proposes two frameworks for continuous learning in interactive instruction-following agents, emphasizing the need for agents to learn incrementally as they explore their environments [17][18]. - A confidence-aware moving average mechanism is introduced to update parameters without relying on task boundary information [18]. Group 7: AMAGO - AMAGO is a scalable contextual reinforcement learning framework that addresses challenges in generalization, long-term memory, and meta-learning [21]. - The framework allows for parallel training of long-sequence transformers, enhancing scalability and performance in complex tasks [21]. Group 8: PDDL-based Planning - The study presents a novel paradigm for task planning using pre-trained LLMs, focusing on building explicit world models through PDDL [22][23]. - The framework significantly reduces the need for human intervention by allowing LLMs to convert between PDDL and natural language, facilitating efficient model correction [23].