Workflow
通用视觉基础模型
icon
Search documents
DeepMind率先提出CoF:视频模型有自己的思维链
量子位· 2025-09-28 03:39
Core Viewpoint - DeepMind introduces the concept of Chain-of-Frames (CoF) for video models, paralleling the Chain-of-Thought (CoT) in language models, suggesting a shift towards general-purpose visual understanding capabilities in machine vision [1][3][28]. Group 1: Introduction of CoF - The CoF concept arises from the curiosity of whether video generation models can achieve general-purpose capabilities similar to large language models (LLMs) without specialized training [6][7]. - The goal is to validate the hypothesis that video models can perform various visual tasks using a single underlying logic based on vast data [7][8]. Group 2: Capabilities of Veo 3 - Veo 3 demonstrates four progressive capabilities: 1. It can handle many classic visual tasks without specialized training, showcasing perceptual abilities [10][11]. 2. It can establish rules of the visual world, indicating modeling capabilities [13][14]. 3. It can perform creative modifications and simulations, reflecting operational abilities [16]. 4. It can achieve cross-temporal visual reasoning, embodying the CoF concept [18][21]. Group 3: Performance Analysis - Analysis of 62 qualitative tasks and 7 quantitative tasks revealed that Veo 3 can solve many tasks it has not been specifically trained for, indicating its general potential [23]. - The performance of Veo 3 shows significant improvement over its predecessor, Veo 2, suggesting rapid development in video model capabilities [24][25]. Group 4: Future Outlook - DeepMind predicts that general-purpose models like Veo 3 will eventually replace specialized models in the video domain, similar to the evolution seen in LLMs [25][26]. - The cost of video generation is currently higher than specialized models, but it is expected to decrease over time, paralleling trends observed in LLMs [25][26].