生成式模型

Search documents
两张图就能重构3D空间?清华&NTU利用生成模型解锁空间智能新范式
量子位· 2025-07-09 01:18
Core Viewpoint - LangScene-X introduces a generative framework that enables the construction of generalized 3D language-embedded scenes using only sparse views, significantly reducing the number of required input images compared to traditional methods like NeRF, which typically need over 20 views [2][5]. Group 1: Challenges in 3D Language Scene Generation - The current 3D language scene generation faces three core challenges: the contradiction between dense view dependency and sparse input absence, leading to severe 3D structure artifacts and semantic distortion when using only 2-3 images [5]. - There is a disconnection in cross-modal information and a lack of 3D consistency, as existing models process appearance, geometry, and semantics independently, resulting in semantic misalignment [6]. - High-dimensional compression of language features and the bottleneck in generalization capabilities hinder practical applications, with existing methods showing a significant drop in accuracy when switching scenes [7]. Group 2: Solutions Offered by LangScene-X - LangScene-X employs the TriMap video diffusion model, which allows for unified multimodal generation under sparse input conditions, achieving significant improvements in RGB and normal consistency errors and semantic mask boundary accuracy [8]. - The Language Quantization Compressor (LQC) revolutionizes high-dimensional feature compression, mapping high-dimensional CLIP features to 3D discrete indices with minimal reconstruction error, enhancing cross-scene transferability [9][10]. - The model integrates a progressive training strategy that ensures the seamless generation of RGB images, normal maps, and semantic segmentation maps, thus improving the efficiency of 3D reconstruction processes [14]. Group 3: Spatial Intelligence and Performance Metrics - LangScene-X enhances spatial intelligence by accurately aligning text prompts with 3D scene surfaces, allowing for natural language queries to identify objects within 3D environments [15]. - Empirical results demonstrate that LangScene-X achieves an overall mean accuracy (mAcc) of 80.85% and a mean intersection over union (mIoU) of 50.52% on the LERF-OVS dataset, significantly outperforming existing methods [16]. - The model's capabilities position it as a potential core driver for applications in VR scene construction, human-computer interaction, and foundational technologies for autonomous driving and embodied intelligence [18].
放榜了!ICCV 2025最新汇总(自驾/具身/3D视觉/LLM/CV等)
自动驾驶之心· 2025-06-28 13:34
Epona: Autoregressive Diffusion World Model for Autonomous Driving SynthDrive: Scalable Real2Sim2RealSensor Simulation Pipeline for High-Fidelity Asset Generation and Driving DataSynthesis 这次ICCV很热闹啊!ICCV25放榜了,陆续有工作放出。自动驾驶之心也给大家盘点下这次中稿的一些工作! 注:部分工作前期已经来到我们自动驾驶之心知识星球做过分享。更多内容欢迎扫码加入我们的自驾社区,第 一时间掌握所有动态。 【视频+解析】 DriveArena: A Controllable Generative Simulation Platform for Autonomous Driving Boost 3D Reconstruction using Diffusion-based Intrinsic Estimation StableDepth:Scene-Consistent andScale-Invariant Monocu ...
ICCV 2025不完全汇总(具身/自驾/3D视觉/LLM/CV等)
具身智能之心· 2025-06-27 09:41
【视频+解析】DriveArena: A Controllable Generative Simulation Platform for Autonomous Driving Boost 3D Reconstruction using Diffusion-based Intrinsic Estimation Epona: Autoregressive Diffusion World Model for Autonomous Driving SynthDrive: Scalable Real2Sim2RealSensor Simulation Pipeline for High-Fidelity Asset Generation and Driving DataSynthesis StableDepth:Scene-Consistent andScale-Invariant Monocular Depth CoopTrack: ExploringEnd-to-End Learning for EfficientCooperative Sequential Perception U-ViLAR: Uncertai ...
苹果憋一年终超同参数 Qwen 2.5?三行代码即可接入 Apple Intelligence,自曝如何做推理
AI前线· 2025-06-10 10:05
整理 | 华卫、核子可乐 在今年的 WWDC 全球开发者大会上,苹果推出新一代专为增强 Apple Intelligence 功能所开发的语 言基座模型。经过优化的最新基座模型可在苹果芯片上高效运行,包括一个约 3B 参数的紧凑型模型 和一个基于服务器的混合专家模型,后者为专门针对私有云量身定制的全新架构。 这两大基座模型,均隶属于苹果为支持用户而打造的生成式模型家族。这些模型改进了工具使用与推 理能力,可以理解图像与文本输入,速度更快、效率更高,而且能够支持 15 种语言及平台中集成的 各种智能功能。 据介绍,苹果通过开发新的模型架构来提高这两个模型的效率。对于设备端模型,将整个模型按 5: 3 的深度比分为两块。块 2 中的所有键值(KV)缓存都直接与块 1 最后一层生成的缓存共享,由此 将键值缓存的内存占用量降低了 38.5%,同时显著改善了首个 token 生成时间(time-to-first- token)。 苹果还引入并行轨道专家混合 (PT-MoE) 设计,为服务器端模型开发出一套新架构。此模型由多 个较小的 Transformer(即「轨道」)组成,它们独立处理各 token,仅在各轨道块的输 ...
一个md文件收获超400 star,这份综述分四大范式全面解析了3D场景生成
机器之心· 2025-06-10 08:41
在构建通用人工智能、世界模型、具身智能等关键技术的竞赛中,一个能力正变得愈发核心 —— 高质量的 3D 场景生成 。过去三年,该领域的研究呈指数级增 长,每年论文数量几乎翻倍,反映出其在多模态理解、机器人、自动驾驶乃至虚拟现实系统中的关键地位。 技术路线 四大生成范式全面解析 早期的 3D 场景生成工作主要通过程序化生成实现。自 2021 年以来,随着生成式模型(尤其是扩散模型)的崛起,以及 NeRF、3D Gaussians 等新型 3D 表征的提 出,该领域进入爆发式增长阶段。方法日益多元,场景建模能力持续提升,也推动了研究论文数量的快速上升。这一趋势凸显出对对该领域进行系统化梳理与全 面评估的迫切需求。 论文标题:3D Scene Generation: A Survey 论文链接:https://arxiv.org/abs/2505.05474 精选列表:https://github.com/hzxie/Awesome-3D-Scene-Generation 在本综述中,研究团队构建了一套系统的技术分类体系,将现有 3D 场景生成方法划分为四大主流范式,每类方法均结合代表性工作进行了深入梳理。 这四大 ...
真有人会爱上ChatGPT?我尝试和AI“交往”一周后发现有些不对劲
Hu Xiu· 2025-05-11 07:02
Group 1 - The article discusses the growing phenomenon of human-AI relationships, highlighting cases where individuals have developed emotional connections with AI, leading to significant life decisions such as divorce and marriage to AI [2][35][41] - It mentions that some users have become so immersed in their interactions with AI that they perceive it as a friend or partner, which raises concerns about the implications for real-life relationships and mental health [6][41][49] - The article emphasizes the need for users to be aware of the potential for dependency on AI, especially for those with underlying psychological issues, and suggests that AI should not replace human interaction [42][57] Group 2 - The text outlines various strategies for users to enhance their interactions with AI, such as customizing prompts and understanding the AI's response patterns to create a more engaging experience [9][31][44] - It highlights the importance of treating AI as a conversational partner rather than just a tool, which can lead to deeper self-reflection and personal insights for users [32][41] - The article also points out the limitations of AI, noting that while it can provide immediate feedback and companionship, it lacks true emotional understanding and memory retention, which can lead to disillusionment [55][56]