零样本泛化
Search documents
统一视觉多模态!港科大团队发布视频生成模型,加速真实世界理解
具身智能之心· 2025-12-17 00:05
编辑丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 下面是更多详细内容。 不仅能"听懂"物体的颜色纹理,还能"理解"深度图、人体姿态、运动轨迹…… 统一多模态多任务的视频生成模型来了。 来自港科大、港中文、清华大学和快手可灵的研究团队,最近提出了一个全新视觉框架—— UnityVideo 。 不仅模型生成质量更高,它还实现了 零样本泛化 ,对于从未见过的物体或场景,也能生成合理结果。 它通过统一训练多种视觉模态 (如深度图、光流、骨骼、分割掩码等) ,让模型更懂物理世界规律,生成的视频更真实、更可控。 从文本大模型到视觉大模型 当回顾大语言模型 (LLMs) 的发展历程时,会发现一个有趣的现象: GPT、Claude等模型之所以拥有强大的泛化和推理能力,很大程度上得益于它们统一训练了多种文本子模态——自然语言、代码、数学表达 式等。 这种多模态统一训练使模型能够在不同领域之间进行知识迁移,从而涌现出惊人的推理能力。 那么,视觉领域是否也存在同样的 ...
混元3D开源端到端全景深度估计器,代码+精选全景数据已上线,在线可玩
量子位· 2025-10-14 04:08
Core Insights - The article discusses the development of DA, a novel end-to-end panoramic depth estimator by Tencent's Mixed Reality 3D team, which addresses the challenges of panoramic data scarcity and zero-shot generalization capabilities [2][8]. Group 1: Background and Challenges - Panoramic images provide a 360°×180° immersive view, essential for advanced applications like AR/VR and 3D scene reconstruction [5][6]. - Traditional methods for depth estimation in panoramic images are limited due to the scarcity of panoramic depth data and the inherent spherical distortion of panoramic images [10][12]. - The team aims to expand panoramic data and build a robust data foundation for DA [8]. Group 2: Data Augmentation Engine - The team developed a data management engine to convert high-quality perspective depth data into panoramic data, significantly increasing the quantity and diversity of panoramic samples [11][14]. - Approximately 543K panoramic samples were created, expanding the total sample size from about 63K to approximately 607K, addressing the issue of data scarcity [14]. Group 3: Model Architecture and Training - The SphereViT architecture was introduced to mitigate the effects of spherical distortion, allowing the model to focus on the spherical geometry of panoramic images [16][17]. - The training process incorporates distance loss for global accuracy and normal loss for local surface smoothness, enhancing the model's performance [18]. Group 4: Experimental Results - DA demonstrated state-of-the-art (SOTA) performance, with an average improvement of 38% in AbsRel performance compared to the strongest zero-shot methods [23][24]. - Qualitative comparisons showed that DA's training utilized approximately 21 times more panoramic data than UniK3D, resulting in more accurate geometric predictions [27]. Group 5: Application Scenarios - DA's exceptional zero-shot generalization capabilities enable a wide range of 3D reconstruction applications, such as panoramic multi-view reconstruction [28]. - The model can reconstruct globally aligned 3D point clouds from panoramic images of different rooms in a house or apartment, ensuring spatial consistency across multiple panoramic views [29].