BABA-阿里巴巴(09988)旗下通义千问发布Qwen3-Omni原生全模态大模型

Core Insights - Alibaba's subsidiary Tongyi Qianwen has officially launched Qwen3-Omni, a native multimodal large model capable of seamlessly processing various input forms including text, images, audio, and video while generating text and natural speech output in real-time [1] Group 1: Model Features - Qwen3-Omni is designed as a fully multimodal model that maintains intelligence across different modalities without degradation [1] - The model architecture utilizes the Thinker-Talker framework, where Thinker is responsible for text generation and Talker focuses on streaming voice token generation [1] - To achieve ultra-low latency in streaming generation, Talker predicts multiple codebook sequences in an autoregressive manner, outputting the residual codebook for the current frame [1] Group 2: Technical Implementation - The Code2Wav module synthesizes the corresponding waveform for each frame, enabling frame-by-frame streaming generation [1]