不止会动嘴，还会「思考」！字节跳动发布OmniHuman-1.5，让虚拟人拥有逻辑灵魂

Core Viewpoint - The article discusses the launch of OmniHuman-1.5 by ByteDance, a new virtual human generation framework that enhances the capabilities of virtual humans to think and express emotions, moving beyond simple imitation to more complex interactions [2][39]. Group 1: Technological Advancements - OmniHuman-1.5 introduces a dual-system framework that incorporates Daniel Kahneman's "dual-system theory," allowing virtual humans to engage in thoughtful reasoning and emotional expression [4][13]. - The model demonstrates logical reasoning abilities, enabling it to understand instructions and execute complex actions in a coherent sequence [6][7]. - It can manage long videos and multi-character interactions, showcasing diverse expressions and movements, thus eliminating monotony [8]. Group 2: Framework Components - The framework consists of two main components: System 1, which handles reactive rendering, and System 2, which is responsible for thoughtful planning [14][18]. - System 2 utilizes a multi-modal large language model (MLLM) to generate a coherent action plan based on input from various modalities [17]. - System 1 employs a specially designed multi-modal diffusion model (MMDiT) to synthesize the final video by integrating high-level planning with low-level audio signals [18][27]. Group 3: Innovations and Solutions - The introduction of the "pseudo last frame" concept allows the model to maintain identity consistency while enabling diverse actions, balancing between fixed identity and dynamic range [25][20]. - The "two-stage warm-up" training strategy helps mitigate modal conflicts by ensuring that each branch of the model retains its strengths during training [28][34]. - The model's architecture has been validated through ablation studies, demonstrating the effectiveness of both the reasoning and execution components in enhancing output quality [35][36]. Group 4: Performance Metrics - OmniHuman-1.5 outperforms previous models across various metrics, showcasing significant improvements in logical coherence and semantic consistency [36][37]. - The model's ability to think and express emotions has been quantitatively validated, indicating a leap from mere reactive behavior to more sophisticated interactions [37][39].