Workflow
语音大模型
icon
Search documents
科大讯飞首发百变声音复刻技术
Core Insights - The article highlights the launch of iFlytek's innovative "Voice Cloning" technology based on the Spark Voice Model, which allows users to replicate any voice with high fidelity using just one recording [1] - This breakthrough is expected to bring transformative changes to various fields, including digital humans, audiobooks, and content creation [1] Group 1 - iFlytek has introduced a pioneering technology that enables high-fidelity voice replication from a single recording [1] - The technology allows users to create voices in any style with a simple command [1] - The advancements in this technology are anticipated to disrupt multiple industries, enhancing digital content and user interaction [1]
国网山东电科院牵头一项语音大模型领域IEEE国际标准获批立项
Core Insights - The IEEE international standard titled "Guide for a Data and Knowledge Processing Framework for Constructing Large Speech-Language Models" has been successfully approved for establishment, marking a significant breakthrough for the State Grid Shandong Electric Power Research Institute in the field of speech large models [1][4] Group 1: Industry Context - Speech large models are widely applied in various sectors such as smart automobiles, smart IoT devices, intelligent customer service, and smart education [3] - There are significant differences between the data used for constructing speech large models and traditional text data in terms of storage format, annotation format, feature structure, and processing, leading to issues like non-unified data formats, difficulties in cross-organizational sharing, lack of version management, high data security risks, and low processing efficiency [3] Group 2: Standard Development Process - Under the guidance of the Digital Department of the State Grid Shandong Electric Power Company, the institute leveraged its extensive experience in AI technology research and engineering practice to address industry pain points and proactively initiated the application for the IEEE international standard [4] - The team conducted technical research and validation to ensure the scientific and forward-looking nature of the standard, preparing the standard PAR and presentation materials, and participating in the IEEE Knowledge Engineering Standardization Committee plenary meeting to report on the standard framework [4] - The approved standard aims to establish a framework for data processing and management for constructing speech large models, addressing issues such as non-unified storage specifications and formats, difficulties in cross-organizational sharing, lack of version management, high security risks, low query processing efficiency, and high annotation costs [4] Group 3: Future Plans - The State Grid Shandong Electric Power Research Institute plans to collaborate with domestic and international partners to accelerate the standard development process, ensuring that the standard aligns with international norms and reflects China's technological advantages in this field [6]
小米开源首个原生端到端语音大模型,消费电子ETF(561600)涨超1.2%冲击8连涨
Xin Lang Cai Jing· 2025-09-19 02:16
Group 1 - Xiaomi has officially open-sourced its first native end-to-end voice model, Xiaomi-MiMo-Audio, which is based on an innovative pre-training architecture and over one billion hours of training data, achieving few-shot generalization based on ICL in the voice domain for the first time [1] - The improvement in Xiaomi's AI capabilities is expected to enhance the user experience of related consumer electronics products [1] - As of September 19, 2025, the CSI Consumer Electronics Theme Index (931494) surged by 1.53%, with notable increases in component stocks such as Lixun Precision (002475) up by 4.97% and Industrial Fulian (601138) up by 5.95% [1] Group 2 - As of August 29, 2025, the top ten weighted stocks in the CSI Consumer Electronics Theme Index (931494) include Cambricon (688256), Lixun Precision (002475), and SMIC (688981), collectively accounting for 54.8% of the index [2] - The CSI Consumer Electronics Theme Index consists of 50 listed companies involved in component production and complete brand design and manufacturing, reflecting the overall performance of consumer electronics-related securities [1][2] - The performance of the Consumer Electronics ETF (561600) closely tracks the CSI Consumer Electronics Theme Index, which has recently seen an eight-day consecutive rise [1]
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
李沐B站更新了!教你手搓语音大模型,代码全开源还能在线试玩
量子位· 2025-07-23 06:36
Core Insights - The article discusses the return of Li Mu and his new audio model, Higgs Audio V2, which integrates text and speech processing capabilities [1][2]. Group 1: Model Capabilities - Higgs Audio V2 can handle various speech tasks, including generating multilingual dialogues, automatic prosody adjustment, melody humming with cloned voices, and simultaneous generation of speech and background music [3][4]. - The model integrates 10 million hours of speech data into the training of a large language model (LLM), enabling it to both understand and generate speech [4][6]. Group 2: Technical Implementation - The model combines traditional text and speech models, allowing LLMs to communicate using speech by converting speech tasks into a unified processing format [7][8]. - A unified discretization audio tokenizer was developed to maintain audio quality while capturing semantic and acoustic features at a rate of 25 frames per second [11][13]. - The training data for the model was sourced from various platforms, ensuring quality by filtering out 90% of the data to meet the 10 million hours requirement [14][15]. Group 3: Model Training and Architecture - To enhance the model's understanding and generation of sound, a secondary audio model, AudioVerse, was trained to analyze user speech input and provide contextual information for the main model [16]. - The final multimodal model can perform complex tasks, such as writing and singing a song with accompanying music, and can analyze scenes and characters based on audio input [17][18]. Group 4: Performance Metrics - In real-time voice chat, the model achieves low latency and can understand and express emotions, outperforming other models in emotional and question categories by 75.7% and 55.7%, respectively [19]. - The model also excelled in traditional TTS benchmark tests, achieving the best performance in various evaluations [20]. Group 5: Accessibility and Community Engagement - The model's code has been made publicly available on GitHub, along with an online demo platform for users to experiment with [23][31]. - The article encourages users, especially those interested in creating content like virtual streamers, to try the model for voice cloning and other applications [25].