语音大模型

Search documents
小米开源首个原生端到端语音大模型,消费电子ETF(561600)涨超1.2%冲击8连涨
Xin Lang Cai Jing· 2025-09-19 02:16
9月19日,小米正式开源首个原生端到端语音模型——Xiaomi-MiMo-Audio,它基于创新预训练架构和上亿小时训练数据,首次在语音领域实现基于ICL的少 样本泛化,并在预训练观察到明显的"涌现"行为。小米AI功能的提升有望带动相关消费电子产品的体验优化。 截至2025年9月19日 09:54,中证消费电子主题指数(931494)强势上涨1.53%,成分股澜起科技(688008)上涨7.81%,工业富联(601138)上涨5.95%,立讯精密 (002475)上涨4.97%,华工科技(000988),鹏鼎控股(002938)等个股跟涨。消费电子ETF(561600)上涨1.23%, 冲击8连涨。最新价报1.24元。 消费电子ETF紧密跟踪中证消费电子主题指数,中证消费电子主题指数选取50只业务涉及元器件生产、整机品牌设计及生产等消费电子相关的上市公司证券 作为指数样本,以反映消费电子主题上市公司证券的整体表现。 数据显示,截至2025年8月29日,中证消费电子主题指数(931494)前十大权重股分别为寒武纪(688256)、立讯精密(002475)、中芯国际(688981)、工业富联 (601138)、京 ...
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
李沐B站更新了!教你手搓语音大模型,代码全开源还能在线试玩
量子位· 2025-07-23 06:36
Core Insights - The article discusses the return of Li Mu and his new audio model, Higgs Audio V2, which integrates text and speech processing capabilities [1][2]. Group 1: Model Capabilities - Higgs Audio V2 can handle various speech tasks, including generating multilingual dialogues, automatic prosody adjustment, melody humming with cloned voices, and simultaneous generation of speech and background music [3][4]. - The model integrates 10 million hours of speech data into the training of a large language model (LLM), enabling it to both understand and generate speech [4][6]. Group 2: Technical Implementation - The model combines traditional text and speech models, allowing LLMs to communicate using speech by converting speech tasks into a unified processing format [7][8]. - A unified discretization audio tokenizer was developed to maintain audio quality while capturing semantic and acoustic features at a rate of 25 frames per second [11][13]. - The training data for the model was sourced from various platforms, ensuring quality by filtering out 90% of the data to meet the 10 million hours requirement [14][15]. Group 3: Model Training and Architecture - To enhance the model's understanding and generation of sound, a secondary audio model, AudioVerse, was trained to analyze user speech input and provide contextual information for the main model [16]. - The final multimodal model can perform complex tasks, such as writing and singing a song with accompanying music, and can analyze scenes and characters based on audio input [17][18]. Group 4: Performance Metrics - In real-time voice chat, the model achieves low latency and can understand and express emotions, outperforming other models in emotional and question categories by 75.7% and 55.7%, respectively [19]. - The model also excelled in traditional TTS benchmark tests, achieving the best performance in various evaluations [20]. Group 5: Accessibility and Community Engagement - The model's code has been made publicly available on GitHub, along with an online demo platform for users to experiment with [23][31]. - The article encourages users, especially those interested in creating content like virtual streamers, to try the model for voice cloning and other applications [25].