大模型听懂语音却反而变笨？港中深与微软联合解决语音大模型降智问题

Core Insights - The article discusses the challenges faced by Speech Large Language Models (LLMs) in maintaining logical reasoning capabilities when transitioning from text to speech input, a phenomenon termed the "Modality Reasoning Gap" [2][3][10] - Major tech companies like OpenAI, Google, and Meta are grappling with this issue, as evidenced by a significant drop in accuracy from 92% in text-to-text tasks to 66% in speech-to-speech tasks for models like GPT-4o [3] - The article introduces TARS (Trajectory Alignment for Reasoning in Speech), a new framework developed by Hong Kong University of Science and Technology and Microsoft, which utilizes reinforcement learning to align reasoning processes in speech input with those in text input, effectively restoring and even surpassing reasoning capabilities [7][30] Group 1: Challenges in Speech LLMs - The introduction of speech input leads to a drastic decline in reasoning ability, with a noted 26% drop in accuracy when switching from text to speech [3][10] - Existing methods to bridge this gap, such as input alignment and output memorization, have proven inadequate due to the inherent differences between speech and text [11][12] - The article highlights the concept of "Multimodal Tax," where the inclusion of audio data detracts from the model's pure reasoning capabilities [3] Group 2: TARS Framework Innovations - TARS employs a novel approach using on-policy reinforcement learning to dynamically align the reasoning trajectories of speech and text, rather than relying on static memorization [12][30] - Key innovations in TARS include: - Representation Alignment: This involves calculating the cosine similarity of hidden states between speech and text inputs at each layer, providing a reward for maintaining alignment [15][16] - Behavior Alignment: Instead of requiring exact token matches, TARS assesses semantic consistency using external embedding models, allowing for more flexible output [17][21] - Asymmetric Reward and Modality Normalization: TARS implements a reward system that incentivizes the speech branch to catch up with the text branch while normalizing rewards to ensure continuous improvement [22][23] Group 3: Experimental Results and Impact - TARS has demonstrated a 100% restoration of reasoning capabilities in speech models, achieving significant performance improvements on challenging benchmarks [24][28] - The framework has shown that the reasoning ability of speech models can not only match but exceed that of text models, with a mean reciprocal rank (MRR) of 100.45% achieved in experiments [33] - TARS has outperformed existing state-of-the-art methods, establishing itself as a leading solution in the field of speech LLMs [33]