Workflow
跨模态融合
icon
Search documents
SIGIR 2025 | 视频检索新范式!北邮、北大等联合提出AV-NAS:首个音视频哈希搜索架构,让Mamba与Transformer自动“组队”
AI前线· 2026-01-05 08:33
作者 | 陈勇 在海量视频检索场景中,传统方法往往"重视觉、轻听觉",且网络结构设计更多依赖经验与人工尝试,难以同时兼顾高效存储与快速检索。那么,是否 存在一种能够自动找到最优结构、并充分发挥多模态价值的方案? 近日,来自北邮与北大的研究团队提出 AV-NAS,在多模态视频哈希领域首次引入神经架构搜索(NAS),构建了一个同时覆盖 Transformer 与 Mamba 的统一搜索空间。该方法不仅使模型能够自动发现最优的跨模态融合机制(Cross-Mamba),还揭示了一个颇具启发性的结论——在音频时序 建模任务中,看似简单的 "CNN + FFN" 结构竟然优于复杂的 Transformer 方案。 论文题目: AV-NAS: Audio-Visual Multi-Level Semantic Neural Architecture Search for Video Hashing 论文链接: https://dl.acm.org/doi/10.1145/3726302.3729899 代码链接: https://github.com/iFamilyi/AV-NAS 目前,AV-NAS 已被 SIGIR 2 ...
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].