演示视频生成

Search documents
演讲生成黑科技,PresentAgent从文本到演讲视频
机器之心· 2025-07-18 08:18
Core Viewpoint - PresentAgent is introduced as a multimodal agent capable of transforming lengthy documents into narrated presentation videos, overcoming limitations of existing methods that primarily generate static slides or text summaries [1][9]. Group 1: System Overview - PresentAgent employs a modular process that includes systematic document segmentation, slide style planning and rendering, context-aware voice narration generation using large language models, and precise audio-visual alignment to create a complete video [3][5][19]. - The system takes various document types (e.g., web pages, PDFs) as input and outputs a presentation video that combines slides with synchronized narration [17][19]. Group 2: Evaluation Framework - PresentEval is introduced as a unified evaluation framework driven by visual-language models, assessing content fidelity, visual clarity, and audience comprehension [6][10]. - The evaluation is based on a carefully curated dataset of 30 document-presentation pairs, demonstrating that PresentAgent performs close to human levels across all evaluation metrics [7][12]. Group 3: Contributions - The paper presents a new task of "document-to-presentation video generation," aiming to automatically create structured slide videos with voice narration from various long texts [12]. - A high-quality benchmark dataset, Doc2Present Benchmark, is constructed to support the evaluation of document-to-presentation video generation [12]. - The modular design of PresentAgent allows for controllable, interpretable, and multimodal alignment, balancing high-quality generation with fine-grained evaluation [19][27]. Group 4: Experimental Results - The main experimental results indicate that most variants of PresentAgent achieve comparable or superior test accuracy to human benchmarks, with Claude-3.7-sonnet achieving the highest accuracy of 0.64 [22][25]. - Subjective quality assessments show that while human-made presentations still lead in overall video and audio ratings, some PresentAgent variants demonstrate competitive performance, particularly in content and visual appeal [26][27]. Group 5: Case Study - An example of a fully automated presentation video generated by PresentAgent illustrates the system's ability to identify structural segments and produce slides with conversational subtitles and synchronized voice, effectively conveying technical information [29].