AAAI 2026｜视频大语言模型到底可不可信？23款主流模型全面测评来了

Core Insights - The article discusses the development of Trust-videoLLMs, a comprehensive evaluation benchmark for video large language models, addressing challenges in authenticity, safety, fairness, robustness, and privacy [3][6][13]. Evaluation Framework - Trust-videoLLMs includes a systematic, multi-layered, and scalable evaluation system with five core dimensions: - Truthfulness: Video description, temporal understanding, event reasoning, and hallucination suppression - Robustness: Noise interference, temporal disturbance, adversarial attacks, and modality conflict - Safety: Harmful content identification, harmful instruction rejection, deepfake detection, and jailbreak attack defense - Fairness: Stereotype identification, occupational bias, and time sensitivity analysis - Privacy: Privacy content recognition, celebrity privacy protection, and self-inference of privacy [6][9]. Evaluation Tasks - The evaluation tasks cover three main aspects, including contextual reasoning, temporal reasoning, video description, event understanding, and hallucination in videos, among others [8][11]. Model Assessment - The evaluation encompasses 23 mainstream video large language models, including 5 commercial models and 18 open-source models, with varying parameter scales and architectural designs [10][12]. Key Findings - Model size does not equate to stronger performance, as larger models do not necessarily outperform smaller ones [16]. - Closed-source models, such as Claude and Gemini1.5, demonstrate superior safety, privacy protection, and multi-modal alignment compared to open-source models [17]. - Video context significantly impacts safety, as harmful text prompts paired with relevant videos increase the likelihood of generating harmful content [18]. - Fairness issues are prevalent, with models showing biases related to gender, age, and skin color, where closed-source models perform better due to data cleaning and ethical constraints [19]. - Privacy protection is a double-edged sword; stronger models can better identify privacy content but also risk inferring private information [20]. Open-source Tools and Data - To promote the development of trustworthy video large models, the team has open-sourced a large-scale video dataset containing 6,955 videos covering multiple scenes and tasks, along with a unified evaluation toolbox [24].