Snapchat提出Canvas-to-Image：一张画布集成 ID、姿态与布局

Core Viewpoint - Canvas-to-Image is a new framework for compositional image generation that integrates various control signals into a single canvas, simplifying the image generation process by allowing users to provide multiple types of control information simultaneously [2][9][31] Group 1: Traditional Control Limitations - Traditional image generation methods utilize independent input paths for identity reference, pose sketches, and layout boxes, leading to a fragmented and lengthy process [7][8] - Users are unable to overlay multiple control signals in the same area of an image, which restricts the complexity of scene construction [8][9] Group 2: Canvas-to-Image Methodology - The Canvas-to-Image framework consolidates all control signals onto a single canvas, allowing the model to interpret and execute them within the same pixel space [9][10] - The multi-task canvas serves as both the user interface and the model's input, enabling the integration of heterogeneous visual symbols and their spatial relationships [14] Group 3: Training and Inference Process - During training, the model learns from cross-frame image sets, which introduces significant variations in pose, lighting, and expression, preventing it from relying on simple copy mechanisms [15] - In the inference phase, users can flexibly combine multiple control modalities on the same canvas, allowing for complex scene generation without switching between different modules [16] Group 4: Experimental Results - Canvas-to-Image can simultaneously handle identity, pose, and layout box controls, outperforming baseline methods that often fail under similar conditions [18] - The model maintains spatial and semantic relationships between characters and objects, generating scenes with natural interactions and coherence even under complex control settings [20][21] Group 5: Conclusion - The core value of Canvas-to-Image lies in its ability to visualize multi-modal generation controls, making complex scene construction intuitive through direct manipulation on the canvas [31]