Workflow
TARS
icon
Search documents
大模型听懂语音却变笨?港中深与微软联合解决语音大模型降智问题
Xin Lang Cai Jing· 2026-01-19 05:48
Core Insights - The article discusses the challenges faced by Speech LLMs, particularly the "Modality Reasoning Gap," where the reasoning ability of models declines when switching from text to speech input [3][8]. - TARS (Trajectory Alignment for Reasoning in Speech) is introduced as a new framework that utilizes reinforcement learning to align reasoning processes dynamically, overcoming the limitations of traditional methods [7][9]. Group 1: Challenges in Speech LLMs - Speech LLMs experience a significant drop in logical reasoning capabilities when processing audio inputs compared to text inputs [3][8]. - Previous attempts to bridge the reasoning gap have been inadequate, focusing either on input alignment or output memorization, which do not address the deeper representation drift [8][9]. Group 2: TARS Framework Innovations - TARS employs on-policy reinforcement learning to dynamically align the reasoning trajectories of speech and text, rather than forcing a static alignment [9][17]. - Key innovations of TARS include: - Representation Alignment, which directly addresses the internal representation drift [11]. - Behavior Alignment, introducing flexible alignment standards at the output stage [12]. - Asymmetric rewards and modality-specific normalization to optimize the training process for speech models [13][14]. Group 3: Experimental Results - TARS demonstrated a 100% restoration of reasoning capabilities in speech models, achieving significant performance improvements on high-difficulty benchmarks [15][16]. - The model's performance metrics showed that TARS not only improved speech reasoning but also enhanced text reasoning accuracy, indicating a holistic improvement in model intelligence [16][17]. Group 4: Future Implications - The introduction of TARS marks a paradigm shift in speech model research, proving that on-policy reinforcement learning is superior to traditional off-policy methods for addressing modality alignment issues [17]. - TARS provides a viable pathway for researchers aiming to develop high-intelligence omni models capable of effective speech interaction [17].
大模型听懂语音却反而变笨?港中深与微软联合解决语音大模型降智问题
机器之心· 2026-01-17 03:24
Core Insights - The article discusses the challenges faced by Speech Large Language Models (LLMs) in maintaining logical reasoning capabilities when transitioning from text to speech input, a phenomenon termed the "Modality Reasoning Gap" [2][3][10] - Major tech companies like OpenAI, Google, and Meta are grappling with this issue, as evidenced by a significant drop in accuracy from 92% in text-to-text tasks to 66% in speech-to-speech tasks for models like GPT-4o [3] - The article introduces TARS (Trajectory Alignment for Reasoning in Speech), a new framework developed by Hong Kong University of Science and Technology and Microsoft, which utilizes reinforcement learning to align reasoning processes in speech input with those in text input, effectively restoring and even surpassing reasoning capabilities [7][30] Group 1: Challenges in Speech LLMs - The introduction of speech input leads to a drastic decline in reasoning ability, with a noted 26% drop in accuracy when switching from text to speech [3][10] - Existing methods to bridge this gap, such as input alignment and output memorization, have proven inadequate due to the inherent differences between speech and text [11][12] - The article highlights the concept of "Multimodal Tax," where the inclusion of audio data detracts from the model's pure reasoning capabilities [3] Group 2: TARS Framework Innovations - TARS employs a novel approach using on-policy reinforcement learning to dynamically align the reasoning trajectories of speech and text, rather than relying on static memorization [12][30] - Key innovations in TARS include: - **Representation Alignment**: This involves calculating the cosine similarity of hidden states between speech and text inputs at each layer, providing a reward for maintaining alignment [15][16] - **Behavior Alignment**: Instead of requiring exact token matches, TARS assesses semantic consistency using external embedding models, allowing for more flexible output [17][21] - **Asymmetric Reward and Modality Normalization**: TARS implements a reward system that incentivizes the speech branch to catch up with the text branch while normalizing rewards to ensure continuous improvement [22][23] Group 3: Experimental Results and Impact - TARS has demonstrated a 100% restoration of reasoning capabilities in speech models, achieving significant performance improvements on challenging benchmarks [24][28] - The framework has shown that the reasoning ability of speech models can not only match but exceed that of text models, with a mean reciprocal rank (MRR) of 100.45% achieved in experiments [33] - TARS has outperformed existing state-of-the-art methods, establishing itself as a leading solution in the field of speech LLMs [33]