MICROSOFT-大模型听懂语音却变笨？港中深与微软联合解决语音大模型降智问题

Core Insights - The article discusses the challenges faced by Speech LLMs, particularly the "Modality Reasoning Gap," where the reasoning ability of models declines when switching from text to speech input [3][8]. - TARS (Trajectory Alignment for Reasoning in Speech) is introduced as a new framework that utilizes reinforcement learning to align reasoning processes dynamically, overcoming the limitations of traditional methods [7][9]. Group 1: Challenges in Speech LLMs - Speech LLMs experience a significant drop in logical reasoning capabilities when processing audio inputs compared to text inputs [3][8]. - Previous attempts to bridge the reasoning gap have been inadequate, focusing either on input alignment or output memorization, which do not address the deeper representation drift [8][9]. Group 2: TARS Framework Innovations - TARS employs on-policy reinforcement learning to dynamically align the reasoning trajectories of speech and text, rather than forcing a static alignment [9][17]. - Key innovations of TARS include: - Representation Alignment, which directly addresses the internal representation drift [11]. - Behavior Alignment, introducing flexible alignment standards at the output stage [12]. - Asymmetric rewards and modality-specific normalization to optimize the training process for speech models [13][14]. Group 3: Experimental Results - TARS demonstrated a 100% restoration of reasoning capabilities in speech models, achieving significant performance improvements on high-difficulty benchmarks [15][16]. - The model's performance metrics showed that TARS not only improved speech reasoning but also enhanced text reasoning accuracy, indicating a holistic improvement in model intelligence [16][17]. Group 4: Future Implications - The introduction of TARS marks a paradigm shift in speech model research, proving that on-policy reinforcement learning is superior to traditional off-policy methods for addressing modality alignment issues [17]. - TARS provides a viable pathway for researchers aiming to develop high-intelligence omni models capable of effective speech interaction [17].