Core Viewpoint - The article highlights the breakthrough capabilities of the MOSS-Transcribe-Diarize model developed by MOSI AI, which excels in multi-speaker automatic speech recognition (ASR) and outperforms existing models like GPT-4o and Gemini in complex audio environments [1][2][9]. Group 1: Model Capabilities - MOSS-Transcribe-Diarize can handle overlapping speech and chaotic dialogue scenarios effectively, demonstrating a significant improvement in transcription accuracy [1][5]. - The model supports a long context window of 128K, allowing it to process audio inputs of up to 90 minutes, showcasing its robustness in complex environments [1][9]. - It achieves state-of-the-art (SOTA) performance across various benchmarks, including AISHELL-4, Podcast, and Movies datasets, particularly excelling in challenging audio conditions [2][16][19]. Group 2: Technical Innovations - The model employs a unified end-to-end multimodal architecture that integrates speech recognition, speaker attribution, and timestamp prediction, addressing the classic SATS (Speaker Attribution and Timestamped Speech) challenge [8][12]. - MOSS-Transcribe-Diarize utilizes a combination of real-world dialogue audio and synthetic data for training, enhancing its robustness against overlapping speech and acoustic variations [13][14]. - The architecture allows for direct output of text with speaker labels and precise timestamps, improving accuracy through semantic information utilization [12][14]. Group 3: Competitive Advantage - In benchmark tests, MOSS-Transcribe-Diarize significantly outperformed competitors like GPT-4o and Gemini 3 Pro in metrics such as Character Error Rate (CER) and optimal permutation Character Error Rate (cpCER), particularly in long audio inputs [16][19]. - The model maintains speaker consistency in long dialogues, reducing performance degradation caused by speaker attribution errors [16]. - It demonstrates superior performance in various scenarios, including real-world meetings, podcasts, and complex film dialogues, proving its versatility and effectiveness [19][21]. Group 4: Future Directions - MOSI AI aims to continue advancing multimodal intelligence, focusing on enabling AI to understand complex real-world contexts and achieve natural, coherent, and reliable interactions [24]. - The company has a strategic vision to develop technologies that enhance real-time dialogue interaction and robust speech understanding, positioning itself as a leader in the AI field [24].
击败GPT、Gemini,复旦×创智孵化创业团队「模思智能」,语音模型上新了
机器之心·2026-01-20 10:19