语音识别
Search documents
Hinton的亿万富豪博士生
量子位· 2026-01-10 03:07
一水 发自 凹非寺 量子位 | 公众号 QbitAI 最开始是一张老照片—— 1986年CMU首届联结主义夏令营合影 。 有人将这张合影誉为 AI界的"索尔维会议" ,认为只要是玩神经网络、计算神经、计算语言的后辈们,几乎都能在这张照片里找到自家祖师 爷。 一张照片,一段往事,一个愈加伟大的人格…… 这就是Hinton最近又在圈内被热议的"江湖往事"。 不信你看,图中圈出来的就是深度学习发明人、诺贝尔物理学奖、图灵奖双料得主 Hinton ,正是在他的坚持下,神经网络才最终迎来春天。 另一位熟面孔是图灵奖得主 Yann LeCun ,他后来发明的卷积神经网络开启了计算机视觉时代。 (ps:LeCun每次外出演讲都会在PPT里 放这张图,真爱粉无疑了) 以及同框的还有Stan Dehaene、Mitsuo Kawato、Jay McClelland等一众在认知科学、神经科学和计算机领域登峰造极的大神…… 虽然在80年代这群年轻人还籍籍无名,但几十年后,他们的影响力,正在统治硅谷和华尔街。 是的,因为照片中还有一位当时的青椒博士生 Peter Brown ,他是Hinton的第一位博士研究生,现任顶尖量化基金文艺 ...
企业通信的下一站:融合与智能
Sou Hu Cai Jing· 2025-12-16 06:20
Core Insights - The article emphasizes the transformation of communication systems from passive recording devices to intelligent partners that analyze customer interactions and provide actionable insights. Group 1: Evolution of Communication Systems - The first phase focused on ensuring reliable communication capabilities [17] - The second phase aimed at creating user-friendly and feature-rich unified platforms [18] - The current third phase addresses the value of communication systems, turning them into intelligent engines for market insights and customer understanding [19] Group 2: Value of Voice Data - Voice messages contain more than just spoken words; they hold emotional nuances and contextual information that can reveal deeper customer sentiments [14] - Silence and unaddressed issues also provide valuable insights into customer needs and product demand [15] - Intelligent systems can facilitate personalized service at scale by analyzing communication preferences across different customer segments [16] Group 3: Integration with Business Operations - The integration of softphone systems with CRM, ERP, and data analytics platforms transforms customer profiles from basic data to rich, multidimensional insights [9] - Marketing strategies can shift from guesswork to data-driven decisions by analyzing common inquiries and customer feedback [10] - Service responses can become proactive rather than reactive, prioritizing customer interactions based on emotional cues detected in voice messages [10]
智谱正式推出「智谱AI输入法」,要真正实现“指尖即模型,语音即指令”
IPO早知道· 2025-12-10 05:30
Core Viewpoint - The article discusses the launch of the Zhipu AI Input Method, which utilizes the GLM-ASR series voice recognition models to enable seamless voice interaction for users, aiming to enhance productivity by allowing tasks to be completed through voice commands rather than traditional typing [2][4]. Group 1: Product Launch and Features - Zhipu officially released and open-sourced the GLM-ASR series voice recognition models on December 10, which includes the GLM-ASR-2512 model that boasts a character error rate (CER) of only 0.0717, demonstrating industry-leading performance in real-time voice-to-text conversion [2][4]. - The Zhipu AI Input Method allows users to perform accurate voice-to-text transcription, translation, rewriting, and other intelligent operations, encapsulating the concept of "voice as command" [4][5]. - The AI Input Method integrates the GLM model capabilities, enabling users to translate, expand, and refine text directly within the input box, streamlining the process without needing to switch between multiple applications [4][5]. Group 2: Targeted Features for Specific Users - A special feature called Vibe Coding is introduced for developers, allowing them to input code logic and comments via voice, enhancing productivity in coding tasks [5]. - The AI Input Method is optimized for public environments, improving the ability to capture soft sounds and distinguish background noise, thus addressing the challenge of using voice input in settings like open offices and libraries [6]. Group 3: Customization and User Experience - Users can set different "persona" styles to alter the expression of the same sentence based on the context, such as formal reports for work or casual language for personal conversations [4]. - The input method supports the import of custom vocabulary and project codes, making it easier for users to include specialized terms in their voice inputs [6].
豆包发布语音识别模型2.0,支持多模态视觉识别和13种海外语种识别
Feng Huang Wang· 2025-12-05 08:55
Core Viewpoint - The article highlights the launch of Doubao-Seed-ASR-2.0 by Huoshan Engine, which significantly enhances voice recognition capabilities through advanced contextual understanding and multi-modal visual recognition [1] Group 1: Model Enhancements - The 2.0 version of the model features improved inference capabilities, allowing for precise recognition through deep contextual understanding [1] - Overall keyword recall rate has increased by 20%, indicating a substantial improvement in recognition accuracy [1] Group 2: Multi-modal and Language Support - The model supports multi-modal visual recognition, enabling it to understand both audio and visual inputs, which enhances text recognition accuracy [1] - It recognizes 13 foreign languages, including Japanese, Korean, German, and French, broadening its applicability in global markets [1] Group 3: Specialized Recognition - The model has been upgraded to better handle complex scenarios involving proper nouns, personal names, geographical names, brand names, and easily confused homophones [1]
火山引擎发布豆包语音识别模型2.0
智通财经网· 2025-12-05 08:24
Core Insights - The core viewpoint of the article is the launch of Doubao-Seed-ASR-2.0 by Huoshan Engine, which significantly enhances voice recognition capabilities through improved contextual understanding and multi-modal visual recognition [1] Group 1: Model Enhancements - The new model features a 20% improvement in overall keyword recall rate through enhanced contextual understanding [1] - It supports multi-modal visual recognition, allowing the model to not only "hear words" but also "see images," improving text recognition accuracy with single and multiple image inputs [1] - The model is capable of accurately recognizing 13 foreign languages, including Japanese, Korean, German, and French [1] Group 2: Technical Specifications - Doubao voice recognition model is built on the Seed mixed expert large language model architecture, maintaining the advantages of the 1.0 version's 2 billion parameter high-performance audio encoder [1] - The upgrade focuses on optimizing recognition in complex scenarios involving proper nouns, names, geographical locations, brand names, and easily confused homophones [1] - Enhanced contextual reasoning capabilities enable the model to achieve multi-modal information understanding and mixed-language recognition accuracy [1]
豆包发布语音识别模型2.0 支持多模态视觉识别和13种海外语种识别
Mei Ri Jing Ji Xin Wen· 2025-12-05 08:10
Core Viewpoint - The article reports the official launch of Doubao-Seed-ASR-2.0, a voice recognition model by Huoshan Engine, which enhances contextual understanding and recognition accuracy through advanced technology [1] Group 1: Model Features - The 2.0 version of the model has improved inference capabilities, achieving a 20% increase in overall keyword recall rate [1] - It supports multimodal visual recognition, allowing the model to understand both audio and visual inputs, thereby enhancing text recognition accuracy [1] - The model can recognize 13 foreign languages, including Japanese, Korean, German, and French [1] Group 2: Targeted Upgrades - The model has been specifically upgraded to handle complex scenarios involving proper nouns, personal names, geographical names, brand names, and easily confused homophones [1]
豆包输入法上线,用了两天我在微信聊天不想再打字
Xin Lang Cai Jing· 2025-11-24 16:25
Core Viewpoint - Doubao Input Method, launched by ByteDance, aims to redefine input experience using AI, particularly excelling in voice recognition capabilities. Group 1: Product Features - Doubao Input Method has a minimalist interface without intrusive ads, but its installation size is relatively large at 139MB and lacks some functionalities, described as a "rough house" [1][2]. - The core competitive advantage lies in its voice typing feature, which significantly outperforms other input methods in terms of user experience [2]. - The voice recognition accuracy for Mandarin, English, and Cantonese is exceptionally high, with successful recognition of complex sentences and phrases [3][4]. - It can even handle mixed-language inputs, such as Cantonese with English phrases, demonstrating its versatility [4]. - The input method supports voice input for mathematical formulas, making it useful for students and educators [5]. - Technically, Doubao Input Method utilizes the Seed-ASR2.0 model, which reduces error rates by 10%-40% compared to previous models [6]. - It offers an offline voice model option, approximately 150MB in size, allowing usage in areas with poor signal [6]. Group 2: User Experience - The basic vocabulary richness is on par with mainstream input methods, effectively recognizing internet slang and rare characters [9]. - AI capabilities enhance the input method's functionality, providing direct answers to queries like "Who is the author of Journey to the West?" [11]. - However, the input method currently only supports Android, with iOS and PC versions forthcoming, limiting cross-device functionality [11]. - Users may experience initial lag in typing responsiveness, but settings allow for adjustments to improve speed [13]. - The input method lacks certain features like emoji search and sending, and currently only supports basic keyboard layouts [15]. Group 3: Future Considerations - While the voice recognition feature is compelling, it is recommended to use Doubao Input Method as a secondary tool until more foundational features are added, such as iOS support and emoji functionality [18].
豆包输入法正式上线,语音识别精准,支持多方言
Xin Lang Ke Ji· 2025-11-24 09:00
Core Viewpoint - Doubao Input Method has officially launched, offering both voice and keyboard input options, enhancing user experience through advanced speech recognition and semantic understanding capabilities [2][3]. Group 1: Product Features - Doubao Input Method utilizes the same voice model as the Doubao App, improving voice recognition and semantic understanding, supporting multiple dialects, English, and mixed Chinese-English input, along with an automatic error correction feature [2]. - The voice input can accurately recognize speech in complex environments, accommodating soft speech, rapid talking, and noisy surroundings, with a personalized recognition effect achieved through user corrections [2]. - The keyboard input also features automatic error correction and supports various intelligent associations for text, symbols, and emoji [2]. Group 2: Dialect Support - The input method currently supports multiple dialects including Cantonese, Sichuan dialect, Shaanxi dialect, Jianghuai dialect, Jilu dialect, Lanyin dialect, and Jin dialect, with some dialect recognition accuracy approaching that of Mandarin, enhancing the experience for users from different regions [2]. Group 3: Availability - Doubao Input Method is now officially available on the Android app store and will soon be launched on the Apple app store [3].
翻译界的ChatGPT时刻,Meta发布新模型,几段示例学会冷门新语言
3 6 Ke· 2025-11-11 12:12
Core Insights - Meta has launched the Omnilingual ASR system, capable of recognizing over 1,600 languages, aiming to bridge the digital divide in language technology [1][2][5] - The system supports 500 languages that have never been transcribed by any AI system before, significantly expanding the language coverage compared to existing models like OpenAI's Whisper [5][7] - Omnilingual ASR introduces a flexible learning mechanism that allows the model to learn new languages from minimal examples, potentially expanding its coverage to over 5,400 languages [10][11] Language Coverage and Performance - The system achieves a low character error rate (CER) of less than 10% for 78% of the tested languages, and this rate increases to 95% for languages with over 10 hours of training data [7][8] - It categorizes languages into high-resource, medium-resource, and low-resource, with 95% of high and medium-resource languages achieving a CER below 10% [8] - Even low-resource languages show promise, with 36% achieving a CER below 10% [8] Open Source and Community Engagement - Omnilingual ASR is fully open-sourced on GitHub, allowing researchers, developers, and organizations to use and modify the model without licensing restrictions [11][13] - Meta has released a large multilingual speech dataset, including transcriptions for 350 underrepresented languages, to support community-driven language recognition [13][14] - The development involved collaboration with local language organizations to gather diverse voice samples, ensuring cultural sensitivity and community involvement [15][16] Technical Specifications - The model architecture includes a range of sizes, from lightweight models suitable for mobile devices to larger models with up to 7 billion parameters for high accuracy [16][18] - Training utilized over 4.3 million hours of audio data across 1,239 languages, marking it as one of the largest and most diverse speech training datasets ever [18] - The system's design allows for continuous growth and adaptation, enabling it to evolve alongside the diversity of human languages [18]
X @𝘁𝗮𝗿𝗲𝘀𝗸𝘆
𝘁𝗮𝗿𝗲𝘀𝗸𝘆· 2025-09-26 16:47
测试过 macOS26 的原生输入法语音识别了,依然被 Soniox 模型吊打。只能说是堪堪可用, 远远达不到好用的状态。尤其是当中英文混输的时候。 ...