零样本推理(Zero-Shot Reasoning)
Search documents
视频模型真在推理,还是「表演」推理?港中文等质疑:Chain-of-Frame是真的吗?
机器之心· 2025-11-18 18:19
Core Insights - The article discusses the advancements in video generation models like Veo and Sora, highlighting their emerging capabilities beyond mere synthesis, particularly in reasoning and perception [2][26]. - A new concept, Chain-of-Frame (CoF), is introduced as a visual analogy to the Chain-of-Thought (CoT) in language models, focusing on the sequential generation of video frames to solve problems [2][9]. Research Findings - A systematic study was conducted by researchers from various universities to evaluate the zero-shot reasoning potential of models like Veo 3, leading to the development of the MME-CoF benchmark, which includes 12 reasoning dimensions [2][18]. - The study revealed that Veo 3 performs well in simple spatial layouts and basic geometric transformations but struggles with complex scenarios, indicating limitations in maintaining global consistency and understanding [13][15][23]. Evaluation Metrics - The MME-CoF benchmark provides a standardized framework to assess video models' reasoning capabilities, covering 12 dimensions and 59 tasks, with a focus on transforming abstract reasoning tasks into visual challenges [18][29]. - Evaluation results show that most video generation models scored below 2 on a scale of 0-4, indicating a lack of robust reasoning capabilities [21][24]. Conclusions - The research concludes that current models do not possess independent zero-shot reasoning abilities, relying instead on data patterns rather than logical deduction [26]. - It emphasizes that strong generation does not equate to strong reasoning, as the models often produce visually plausible results that lack logical coherence [27][28]. - The potential for future development exists, suggesting that these models could serve as complementary components in a more comprehensive multimodal intelligence system [29].