语音大模型
Search documents
国网山东电科院牵头一项语音大模型领域IEEE国际标准获批立项
Zhong Guo Neng Yuan Wang· 2025-09-22 08:07
下一步,国网山东电科院将联合国内外合作单位,加快推进标准编制工作,确保标准内容既符合国际通行规则,充分体现我国在该领 域的技术优势,力争早日完成标准发布,为全球语音大模型数据与知识处理提供参考借鉴。 【责任编辑:王少晨 】 日前,国网山东电科院牵头申报的IEEE国际标准《Guide for a Data and Knowledge Processing Framework for Constructing Large Speech- Language Models(面向语音大模型构建的数据与知识处理框架指南)》成功获批立项,这是该院在语音大模型数据与知识处理领域国际标 准制定工作中的重大突破,标志着该院在语音大模型领域的话语权和影响力得到进一步提升。 语音大模型在智能汽车、智能IoT设备、智能客服、智慧教育等领域广泛应用,用于构建语音大模型的数据与传统文本数据在存储形 态、数据标注格式、数据特征结构、数据处理等方面存在差异,导致相关数据格式不统一、跨组织共享难、数据版本管理缺失、数据安全 隐患高、数据处理效率低等问题,制约了语音大模型的快速迭代与性能优化。 在国网山东省电力公司数字化部的指导支持下,该院凭借着 ...
小米开源首个原生端到端语音大模型,消费电子ETF(561600)涨超1.2%冲击8连涨
Xin Lang Cai Jing· 2025-09-19 02:16
Group 1 - Xiaomi has officially open-sourced its first native end-to-end voice model, Xiaomi-MiMo-Audio, which is based on an innovative pre-training architecture and over one billion hours of training data, achieving few-shot generalization based on ICL in the voice domain for the first time [1] - The improvement in Xiaomi's AI capabilities is expected to enhance the user experience of related consumer electronics products [1] - As of September 19, 2025, the CSI Consumer Electronics Theme Index (931494) surged by 1.53%, with notable increases in component stocks such as Lixun Precision (002475) up by 4.97% and Industrial Fulian (601138) up by 5.95% [1] Group 2 - As of August 29, 2025, the top ten weighted stocks in the CSI Consumer Electronics Theme Index (931494) include Cambricon (688256), Lixun Precision (002475), and SMIC (688981), collectively accounting for 54.8% of the index [2] - The CSI Consumer Electronics Theme Index consists of 50 listed companies involved in component production and complete brand design and manufacturing, reflecting the overall performance of consumer electronics-related securities [1][2] - The performance of the Consumer Electronics ETF (561600) closely tracks the CSI Consumer Electronics Theme Index, which has recently seen an eight-day consecutive rise [1]
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
李沐B站更新了!教你手搓语音大模型,代码全开源还能在线试玩
量子位· 2025-07-23 06:36
Core Insights - The article discusses the return of Li Mu and his new audio model, Higgs Audio V2, which integrates text and speech processing capabilities [1][2]. Group 1: Model Capabilities - Higgs Audio V2 can handle various speech tasks, including generating multilingual dialogues, automatic prosody adjustment, melody humming with cloned voices, and simultaneous generation of speech and background music [3][4]. - The model integrates 10 million hours of speech data into the training of a large language model (LLM), enabling it to both understand and generate speech [4][6]. Group 2: Technical Implementation - The model combines traditional text and speech models, allowing LLMs to communicate using speech by converting speech tasks into a unified processing format [7][8]. - A unified discretization audio tokenizer was developed to maintain audio quality while capturing semantic and acoustic features at a rate of 25 frames per second [11][13]. - The training data for the model was sourced from various platforms, ensuring quality by filtering out 90% of the data to meet the 10 million hours requirement [14][15]. Group 3: Model Training and Architecture - To enhance the model's understanding and generation of sound, a secondary audio model, AudioVerse, was trained to analyze user speech input and provide contextual information for the main model [16]. - The final multimodal model can perform complex tasks, such as writing and singing a song with accompanying music, and can analyze scenes and characters based on audio input [17][18]. Group 4: Performance Metrics - In real-time voice chat, the model achieves low latency and can understand and express emotions, outperforming other models in emotional and question categories by 75.7% and 55.7%, respectively [19]. - The model also excelled in traditional TTS benchmark tests, achieving the best performance in various evaluations [20]. Group 5: Accessibility and Community Engagement - The model's code has been made publicly available on GitHub, along with an online demo platform for users to experiment with [23][31]. - The article encourages users, especially those interested in creating content like virtual streamers, to try the model for voice cloning and other applications [25].