多模态控制
Search documents
Snapchat提出Canvas-to-Image:一张画布集成 ID、姿态与布局
机器之心· 2025-12-09 03:17
Canvas-to-Image 是一个面向组合式图像创作的全新框架。它取消了传统「分散控制」的流程,将身份参考图、空间布局、姿态线稿等不同类型的控制信 息全部整合在同一个画布中。用户在画布上放置或绘制的内容,会被模型直接解释为生成指令,简化了图像生成过程中的控制流程。 作者 :Yusuf Dalva, Guocheng Gordon Qian*, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang 通讯作者 :Guocheng Gordon Qian 机构 :¹Snap Inc. ²UC Merced ³Virginia Tech 论文标题 :Canvas-to-Image: Compositional Image Generation with Multimodal Controls 项目主页 : https://snap-research.github.io/canvas-to-image/ arXiv :arxiv.org/abs/2511.216 ...
腾讯混元3D-Omni:3D版ControlNet突破多模态控制,实现高精度3D资产生成
机器之心· 2025-09-29 06:55
Core Viewpoint - The article discusses the launch of Hunyuan 3D-Omni by Tencent, a unified multimodal controllable 3D generation framework that addresses the limitations of existing methods reliant on image inputs, enhancing the precision and versatility of 3D asset creation in various industries [2][5][31]. Background and Challenges - The increasing scale of 3D data has led to the rise of generative models based on native 3D representations like point clouds and voxels, with Hunyuan3D 2.1 utilizing a combination of 3D Variational Autoencoders (VAE) and Latent Diffusion Models (LDM) for efficient 3D model generation [5]. - Existing methods face challenges such as geometric inaccuracies due to single-view image inputs, difficulties in fine control over object proportions and details, and limitations in adapting to multimodal inputs [6][7]. Core Innovations of Hunyuan3D Omni - Hunyuan 3D-Omni introduces two key innovations: a lightweight unified control encoder for handling multiple control conditions and a progressive difficulty-aware training strategy to enhance robustness in multimodal integration [9][10]. - The framework supports up to four types of control signals, significantly improving the controllability and quality of generated results [9]. Key Implementation Methods - The system utilizes various control signals: 1. Skeleton for character motion control 2. Bounding Box for adjusting object proportions 3. Point Cloud for providing geometric structure prior 4. Voxel for sparse geometric hints [11][14]. Experimental Results - The model demonstrates high-quality generation of character geometries aligned with target poses when using skeleton control, showcasing its ability to maintain geometric details across various input styles [18][19]. - Bounding box control effectively adjusts object proportions, enabling intelligent geometric reconstruction, as evidenced by successful generation of complex structures [23][25]. - Point cloud inputs significantly mitigate geometric ambiguities inherent in single-view images, ensuring accurate alignment with real-world structures [25][27]. - Voxel conditions enhance the model's ability to reconstruct detailed geometric features, improving overall generation quality [27][28]. Conclusion - Hunyuan 3D-Omni represents a lightweight, multimodal, and controllable 3D generation framework that integrates various geometric and control signals without compromising the foundational model capabilities, paving the way for future advancements in multimodal 3D generation [31].