Workflow
多模态流式建模方法
icon
Search documents
阿里通义开源首个CoT音频模型,音·画同步被狠狠拿捏了
量子位· 2025-07-01 03:51
Core Viewpoint - The article discusses the advancements in AI audio generation, specifically highlighting Alibaba's ThinkSound model, which utilizes chain-of-thought (CoT) reasoning to create high-fidelity audio that synchronizes with video content, addressing limitations of traditional audio generation methods [4][11]. Group 1: Technology and Features - ThinkSound is an open-source audio generation model designed for video dubbing, allowing each frame to have a corresponding sound effect [4]. - The model incorporates CoT reasoning to analyze visual dynamics and infer acoustic properties, leading to improved audio-visual synchronization [9][10]. - Official evaluations show that ThinkSound outperforms six mainstream audio generation methods on the VGGSound dataset, achieving significant improvements in key metrics [6]. Group 2: Model Architecture - ThinkSound operates through a three-stage reasoning process: foundational Foley CoT generation, interactive object-centric CoT generation, and instruction-based audio editing CoT generation [16][22]. - The first stage involves analyzing audio and video to construct a structured CoT that ensures temporal alignment for audio synthesis [18]. - The second stage allows users to interactively select video elements for sound analysis, enhancing the model's ability to generate contextually relevant audio [20]. - The final stage enables users to issue editing commands, which the model processes to modify audio according to the provided instructions [23]. Group 3: Data and Training - The model is trained on a specialized dataset called AudioCoT, which includes over 2531.8 hours of audio-visual pairs, ensuring a diverse range of sound effects [31]. - The dataset is derived from various sources, including VGGSound and AudioSet, and is designed to deepen the model's understanding of auditory semantics [31]. Group 4: Performance and Results - The article highlights that the integration of CoT reasoning significantly enhances the realism and quality of generated audio compared to traditional methods [35]. - The model's performance is validated through ablation studies, confirming that the use of CoT reasoning leads to better audio generation outcomes [34]. Group 5: Future Developments - The Alibaba team plans to continue enhancing ThinkSound and aims to release corresponding APIs for broader accessibility [48].