CosyVoice - filings, earnings calls, financial reports, news

CosyVoice

Search documents

Sou Hu Cai Jing· 2025-07-15 13:37

Group 1 - The recent years have seen a small explosion in artificial intelligence, with various tools for voice recognition, meeting summaries, and interactive text models emerging, as well as image generation technologies like Midjourney and StableDiffusion [1] - There is a growing sentiment that these AI tools may not be as user-friendly as initially thought, which can be analyzed through the basic unit of "information" [3] Group 2 - In terms of voice, humans can understand speech at a rate of approximately 150 to 200 words per minute, equating to about 1600 bits of information per minute [4] - For images, a person can theoretically process about 189 MB of image information per minute, assuming one image of 1024x1024 pixels is understood per second [6] - The average reading speed for text is estimated at 250 to 300 words per minute, resulting in an information flow of about 10,000 bits per minute [8][9] Group 3 - Overall, the information transmission capacity is ranked as follows: voice has the least information content at 1600 bits per minute, text is in the middle at 10,000 bits per minute, and images have the highest capacity at 189 MB per minute [11] - AI applications in voice recognition and generation have reached or exceeded human levels, with tools like CosyVoice and SenseVoice performing well [11] - Text-based AI models, particularly after the advent of ChatGPT, are also approaching human-level performance, with models like QWen2 achieving top-tier status [11] - However, image generation and recognition still lag behind, primarily due to the significantly higher information content in images compared to voice and text [11]