SALMONN 系列音视频理解大模型霸榜回归，推理增强、高帧率、无文本泄漏全线突破

Core Insights - The SALMONN family has expanded significantly with the introduction of new models, including video-SALMONN 2/2+, video-SALMONN-o1, and F-16, solidifying its leadership in the open-source audio-visual understanding model space [1][5][18]. Model Developments - video-SALMONN 2+ focuses on high-quality, complete video descriptions, achieving state-of-the-art results in subtitle integrity and accuracy [3]. - The new models utilize advanced training techniques, such as MrDPO multi-round reinforcement learning, to enhance performance and reduce information loss [3][7]. - F-16 is designed for high frame rate video understanding, addressing the limitations of existing models that operate at low sampling rates [18]. Performance Metrics - video-SALMONN 2+ outperforms major closed-source models like GPT-4o and Google Gemini 1.5 Pro across various benchmarks, demonstrating its superior audio-visual integration capabilities [5][20]. - In the Video-MME benchmark, video-SALMONN 2+ achieved an accuracy of 79.7% and excelled in multiple audio-visual understanding tasks [6][20]. Benchmark Innovations - The AVUT benchmark has been introduced to mitigate text leakage in audio-visual understanding tasks, ensuring a more accurate assessment of model capabilities [21][23]. - The benchmark emphasizes the necessity of audio-visual collaborative understanding, highlighting the limitations of models that rely solely on text [21][22]. Research and Development - The research team has established a comprehensive feedback loop from model development to evaluation, enhancing the overall effectiveness and efficiency of the SALMONN series [23]. - The team, based at Tsinghua University's multimedia signal and intelligent information processing lab, focuses on multi-modal large language models and brain health research [23].