谢赛宁团队新作：不用提示词精准实现3D画面控制

Core Viewpoint - The article discusses the innovative Blender Fusion framework developed by the Sesein team, which combines graphic tools (Blender) with diffusion models to enable precise control and flexible manipulation of visual compositions, moving beyond traditional text prompts [6][9]. Group 1: Blender Fusion Framework - Blender Fusion allows users to control the positioning, rotation, and scaling of objects in generated images using keyboard or mouse inputs [2][4]. - The framework operates through a new pipeline that includes three main steps: object and scene separation, 3D editing in Blender, and high-quality image generation using diffusion models [10][9]. Group 2: Step-by-Step Process - The first step involves object-centric layering, where objects are separated from the original scene, and their 3D information is inferred using existing visual models like Segment Anything Model (SAM) and Depth Pro [13][14]. - The second step is Blender-grounded editing, allowing for detailed editing of the separated objects and camera controls within Blender [18]. - The final step is generative compositing, where a dual-stream diffusion compositor enhances the visual quality of the rendered scene while maintaining global consistency [23][22]. Group 3: Techniques and Results - Two important training techniques are introduced: source masking, which helps the model learn to restore complete images based on conditional information, and simulated object jittering, which improves the model's ability to decouple camera and object movements [24]. - Blender Fusion demonstrates effective visual generation capabilities, maintaining spatial relationships and visual coherence in complex scene edits, including single-image processing and multi-image scene reorganization [25][29]. Group 4: User Experience and Implications - The framework provides creators with greater freedom and control, allowing them to manipulate visual elements without being constrained by text prompts [33]. - The process from object layering to high-fidelity generation makes AI image synthesis more intuitive and flexible, akin to building with blocks [35].