AI问答,直接「拍」给你看!来自快手可灵&香港城市大学
KUAISHOUKUAISHOU(HK:01024) 量子位·2025-11-22 03:07

Core Insights - The article introduces a novel AI model called VANS, which generates videos as answers instead of traditional text responses, aiming to bridge the gap between understanding and execution in tasks [3][4][5]. Group 1: Concept and Motivation - The motivation behind this research is to utilize video, which inherently conveys dynamic physical world information that language struggles to describe accurately [5]. - The traditional approach to "next event prediction" has primarily focused on text-based answers, whereas VANS proposes a new task paradigm where the model generates a video as the response [8][9]. Group 2: Model Structure and Functionality - VANS consists of a visual language model (VLM) and a video diffusion model (VDM), optimized through a joint strategy called Joint-GRPO, which enhances collaboration between the two models [19][24]. - The workflow involves two main steps: perception and reasoning, where the input video is encoded and analyzed, followed by conditional generation, where the model creates a video based on the generated text title and visual features [20]. Group 3: Optimization Process - The optimization process is divided into two phases: first, enhancing the VLM to produce titles that are visually representable, and second, refining the VDM to ensure the generated video aligns semantically with the title and context of the input video [25][28]. - Joint-GRPO acts as a director, ensuring that both the "thinker" (VLM) and the "artist" (VDM) work in harmony, improving their outputs through mutual feedback [34][36]. Group 4: Applications and Impact - VANS has two significant applications: procedural teaching, where it can provide customized instructional videos based on user input, and multi-future prediction, allowing for creative exploration of various hypothetical scenarios [37][41]. - The model has shown superior performance in benchmarks, significantly outperforming existing models in metrics such as ROUGE-L and CLIP-T, indicating its effectiveness in both semantic fidelity and video quality [46][47]. Group 5: Experimental Results - Comprehensive evaluations demonstrate that VANS excels in procedural teaching and future prediction tasks, achieving nearly three times the performance improvement in event prediction accuracy compared to the best existing models [44][46]. - Qualitative results highlight VANS's ability to accurately visualize fine-grained actions, showcasing its advanced semantic understanding and visual generation capabilities [50][53]. Conclusion - The research on Video-as-Answer represents a significant advancement in video generation technology, moving beyond entertainment to practical applications, enabling a more intuitive interaction with machines and knowledge [55][56].