Core Insights - Alibaba's subsidiary Tongyi Qianwen has officially launched Qwen3-Omni, a fully multimodal large model capable of seamlessly processing various input forms including text, images, audio, and video while generating text and natural speech output in real-time [1] Model Architecture - Qwen3-Omni utilizes the Thinker-Talker architecture, where the Thinker is responsible for text generation and the Talker focuses on streaming speech token generation, directly receiving high-level semantic representations from the Thinker [1] - To achieve ultra-low latency streaming generation, the Talker predicts multi-codebook sequences in an autoregressive manner, with the MTP module outputting the residual codebook for the current frame during each decoding step [1] - The Code2Wav component synthesizes the corresponding waveform, enabling frame-by-frame streaming generation [1]
阿里巴巴旗下通义千问发布Qwen3-Omni原生全模态大模型