智能体(Agent)研究
Search documents
攻克长文档与多模态挑战,Paper2Video实现学术视频的自动化生产
机器之心· 2025-10-23 02:22
Core Insights - The article discusses the challenges and solutions in automating the generation of academic presentation videos, highlighting the need for a systematic benchmark and framework to improve efficiency and quality in this domain [4][43]. Group 1: Background and Challenges - Academic presentation videos are crucial for research communication but are currently labor-intensive, requiring hours for a few minutes of content, indicating a need for automation [4]. - Existing natural video generation models are inadequate for academic presentations due to unique challenges such as complex inputs from long documents and the need for synchronized multi-modal outputs [4][5]. Group 2: Paper2Video Benchmark - The Paper2Video benchmark was established using 101 academic papers and their corresponding presentation videos, focusing on four evaluation metrics: Meta Similarity, PresentArena, PresentQuiz, and IP Memory [7][10]. - The benchmark provides a reliable basis for evaluating the generation and assessment of multi-modal long-document inputs and outputs, laying the groundwork for automated academic video generation [10][11]. Group 3: Evaluation Metrics - The four evaluation metrics are designed to assess the quality of academic presentation videos from three core perspectives: human-like preference, information transmission, and academic impact [13][16]. - Meta Similarity measures the consistency of generated content with human-designed versions, while PresentArena evaluates the visual quality against human preferences [16][31]. Group 4: PaperTalker Framework - PaperTalker is introduced as the first multi-agent framework for generating academic presentation videos, processing long-dependency multi-modal tasks [17][18]. - The framework consists of four key modules: Slide Builder, Subtitle Builder, Cursor Builder, and Talker Builder, enabling controlled, personalized, and academically styled video generation [23][26]. Group 5: Experimental Results - PaperTalker outperformed other methods in all four evaluation dimensions, demonstrating superior similarity to human-made videos, better information coverage, and enhanced academic memory [32][41]. - The framework's efficiency is attributed to its modular design and the use of Beamer for slide generation, which significantly reduces token consumption and overall generation time [35][36]. Group 6: Contributions of Key Modules - The Cursor Builder module significantly enhances information location and understanding, as evidenced by improved accuracy in tasks involving visual cues [38]. - The Tree Search Visual Choice module plays a critical role in optimizing slide layout and design quality, demonstrating its importance in the overall effectiveness of the generated videos [40][41]. Group 7: Conclusion - The Paper2Video benchmark and PaperTalker framework provide a systematic approach to generating academic presentation videos, with experimental validation showing their advantages in information transmission, visual quality, and academic memory [43].