PresentAgent

Search documents
演讲生成黑科技,PresentAgent从文本到演讲视频
机器之心· 2025-07-18 08:18
Core Viewpoint - PresentAgent is introduced as a multimodal agent capable of transforming lengthy documents into narrated presentation videos, overcoming limitations of existing methods that primarily generate static slides or text summaries [1][9]. Group 1: System Overview - PresentAgent employs a modular process that includes systematic document segmentation, slide style planning and rendering, context-aware voice narration generation using large language models, and precise audio-visual alignment to create a complete video [3][5][19]. - The system takes various document types (e.g., web pages, PDFs) as input and outputs a presentation video that combines slides with synchronized narration [17][19]. Group 2: Evaluation Framework - PresentEval is introduced as a unified evaluation framework driven by visual-language models, assessing content fidelity, visual clarity, and audience comprehension [6][10]. - The evaluation is based on a carefully curated dataset of 30 document-presentation pairs, demonstrating that PresentAgent performs close to human levels across all evaluation metrics [7][12]. Group 3: Contributions - The paper presents a new task of "document-to-presentation video generation," aiming to automatically create structured slide videos with voice narration from various long texts [12]. - A high-quality benchmark dataset, Doc2Present Benchmark, is constructed to support the evaluation of document-to-presentation video generation [12]. - The modular design of PresentAgent allows for controllable, interpretable, and multimodal alignment, balancing high-quality generation with fine-grained evaluation [19][27]. Group 4: Experimental Results - The main experimental results indicate that most variants of PresentAgent achieve comparable or superior test accuracy to human benchmarks, with Claude-3.7-sonnet achieving the highest accuracy of 0.64 [22][25]. - Subjective quality assessments show that while human-made presentations still lead in overall video and audio ratings, some PresentAgent variants demonstrate competitive performance, particularly in content and visual appeal [26][27]. Group 5: Case Study - An example of a fully automated presentation video generated by PresentAgent illustrates the system's ability to identify structural segments and produce slides with conversational subtitles and synchronized voice, effectively conveying technical information [29].
文档秒变演讲视频还带配音!开源Agent商业报告/学术论文接近人类水平
量子位· 2025-07-11 04:00
Core Viewpoint - PresentAgent is a multimodal AI agent designed to automatically convert structured or unstructured documents into video presentations with synchronized voiceovers and slides, aiming to replicate human-like information delivery [1][3][22]. Group 1: Functionality and Process - PresentAgent generates highly synchronized visual content and voice explanations, effectively simulating human-style presentations for various document types such as business reports, technical manuals, policy briefs, or academic papers [3][21]. - The system employs a modular generation framework that includes semantic chunking of input documents, layout-guided slide generation, rewriting key information into spoken text, and synchronizing voice with slides to produce coherent video presentations [11][20]. - The process involves several steps: document processing, structured slide generation, synchronized subtitle creation, and voice synthesis, ultimately outputting a presentation video that combines slides and voice [13][14]. Group 2: Evaluation and Performance - The team conducted evaluations using a test set of 30 pairs of human-made "document-presentation videos" across various fields, employing a dual-path evaluation strategy that assesses content understanding and quality through visual-language models [21][22]. - PresentAgent demonstrated performance close to human levels across all evaluation metrics, including content fidelity, visual clarity, and audience comprehension, showcasing its potential in transforming static text into dynamic and accessible presentation formats [21][22]. - The results indicate that combining language models, visual layout generation, and multimodal synthesis can create an explainable and scalable automated presentation generation system [23].