ThinkSound
Search documents
整个HuggingFace榜,已经被中国AI模型一统江湖了。
数字生命卡兹克· 2025-07-31 01:06
Core Viewpoint - The article highlights a significant shift in the AI landscape, where domestic models in China are rapidly being open-sourced while overseas models are increasing in price and becoming less accessible [3][4][54]. Group 1: Open-source Models - Numerous Chinese companies have been actively open-sourcing their AI models, including MiniMax, Kimi, Qwen, and others [1]. - The top ten models on Hugging Face are all Chinese open-source models, with notable mentions such as Zhiyu GLM-4.5 at the top and Qwen holding five positions [8][9]. - The article emphasizes the rapid development and release of various models over a short period, showcasing the strength of domestic open-source efforts [11][12]. Group 2: Recent Model Releases - Tencent released the Hunyuan A13B model on June 27, featuring 80 billion total parameters and 13 billion active parameters [17][18]. - Baidu's ERNIE 4.5 was officially open-sourced on June 30, offering both pure LLM and multimodal capabilities [20]. - Alibaba's Tongyi launched the first CoT audio model, ThinkSound, on July 1, aimed at video dubbing [21]. - Zhiyu introduced the GLM-4.1V-Thinking model on July 2, which received positive evaluations for its performance [23]. - Kunlun Wanwei released the Skywork-Reward-V2 series on July 4, comprising eight reward models with parameters ranging from 600 million to 8 billion [25][26]. - The MOSS-TTSD model was open-sourced by Qiu Xipeng's team on July 5, trained on a million hours of audio [27]. - Ant Group's KAG-Thinker model, focused on interactive reasoning, was released on July 8 [32]. - The Intern-S1 model, a multimodal model, was launched by the Shanghai AI Lab on July 26 [41]. - Qwen's series of models, including Qwen3-235B and Qwen3-Coder, were released throughout July, achieving high rankings on the Hugging Face leaderboard [37][38][39]. Group 3: Industry Impact - The article reflects on the transformation of the AI landscape over the past two years, noting that China has moved from being a follower to a leader in open-source AI models [11][56]. - The ongoing trend of open-sourcing in China contrasts sharply with the increasing restrictions and pricing of models from overseas companies [54][55]. - The author concludes that this period marks the beginning of a new era for domestic AI models and the Chinese open-source community [56].
阿里通义开源首个CoT音频模型,音·画同步被狠狠拿捏了
量子位· 2025-07-01 03:51
Core Viewpoint - The article discusses the advancements in AI audio generation, specifically highlighting Alibaba's ThinkSound model, which utilizes chain-of-thought (CoT) reasoning to create high-fidelity audio that synchronizes with video content, addressing limitations of traditional audio generation methods [4][11]. Group 1: Technology and Features - ThinkSound is an open-source audio generation model designed for video dubbing, allowing each frame to have a corresponding sound effect [4]. - The model incorporates CoT reasoning to analyze visual dynamics and infer acoustic properties, leading to improved audio-visual synchronization [9][10]. - Official evaluations show that ThinkSound outperforms six mainstream audio generation methods on the VGGSound dataset, achieving significant improvements in key metrics [6]. Group 2: Model Architecture - ThinkSound operates through a three-stage reasoning process: foundational Foley CoT generation, interactive object-centric CoT generation, and instruction-based audio editing CoT generation [16][22]. - The first stage involves analyzing audio and video to construct a structured CoT that ensures temporal alignment for audio synthesis [18]. - The second stage allows users to interactively select video elements for sound analysis, enhancing the model's ability to generate contextually relevant audio [20]. - The final stage enables users to issue editing commands, which the model processes to modify audio according to the provided instructions [23]. Group 3: Data and Training - The model is trained on a specialized dataset called AudioCoT, which includes over 2531.8 hours of audio-visual pairs, ensuring a diverse range of sound effects [31]. - The dataset is derived from various sources, including VGGSound and AudioSet, and is designed to deepen the model's understanding of auditory semantics [31]. Group 4: Performance and Results - The article highlights that the integration of CoT reasoning significantly enhances the realism and quality of generated audio compared to traditional methods [35]. - The model's performance is validated through ablation studies, confirming that the use of CoT reasoning leads to better audio generation outcomes [34]. Group 5: Future Developments - The Alibaba team plans to continue enhancing ThinkSound and aims to release corresponding APIs for broader accessibility [48].