Workflow
Gemini Pro
icon
Search documents
大模型无法真正理解视频,GPT-4o正确率仅36%,南洋理工大团队提出新基准
量子位· 2025-08-01 07:19
Core Viewpoint - The development of Video Large Language Models (Video LLMs) raises the question of whether these models truly "understand" video content or merely perform advanced "pattern matching" [2][3]. Group 1: Introduction of Video Thinking Test (Video-TT) - Researchers from Nanyang Technological University proposed a new benchmark test called Video Thinking Test (Video-TT) to separate the ability to "see" from the ability to "think" [2][3]. - The primary goal of Video-TT is to accurately measure AI's true understanding and reasoning capabilities regarding video content [3]. Group 2: Key Findings - Human performance in video understanding significantly surpasses state-of-the-art (SOTA) models, achieving an accuracy rate of 84.3% compared to the 50% of SOTA models [4][29]. - Open-source models show inferior robustness compared to GPT-4o, which is one of the SOTA models [5]. - GPT-4o struggles with recognizing ambiguous or unconventional content and has difficulties with multi-scene differentiation and world knowledge [5]. Group 3: Limitations of Existing Benchmarks - Current video understanding benchmarks fail to distinguish whether a model's errors stem from not "seeing" enough key frames or from lacking genuine reasoning abilities [9][10]. - The "frame sampling paradox" in long video assessments leads to uncertainty about a model's capabilities when it answers incorrectly due to limited frame sampling [12][13]. - Short video assessments create a "ceiling illusion," where models appear to perform at human levels, misleadingly suggesting that short video understanding issues are resolved [15][16]. Group 4: Design Principles of Video-TT - Video-TT emphasizes the complexity of questions to stimulate "thinking," focusing on context, reasons, and scenarios rather than just question types [17]. - The test incorporates two core dimensions of complexity: visual complexity and narrative complexity, each with four aspects [18][19]. Group 5: Evaluation Results - The evaluation results reveal a significant gap between current SOTA models and human understanding in video reasoning capabilities [26][29]. - GPT-4o's performance is notably below human levels, with a correctness score of only 36.6% [30]. - Open-source models show potential in multiple-choice questions but struggle with open-ended questions, indicating that existing benchmarks may overestimate model capabilities [31]. Group 6: Analysis of AI Errors - The analysis identifies three core weaknesses in models like GPT-4o: confusion in temporal and spatial relationships, lack of world knowledge, and failure to understand complex narratives [34][36]. - Models often misinterpret time and space, struggle with social and cultural context, and fail to connect narrative threads across scenes [38][40].
ICML 2025|多模态理解与生成最新进展:港科联合SnapResearch发布ThinkDiff,为扩散模型装上大脑
机器之心· 2025-07-16 04:21
Core Viewpoint - The article discusses the introduction of ThinkDiff, a new method for multimodal understanding and generation that enables diffusion models to perform reasoning and creative tasks with minimal training data and computational resources [3][36]. Group 1: Introduction to ThinkDiff - ThinkDiff is a collaborative effort between Hong Kong University of Science and Technology and Snap Research, aimed at enhancing diffusion models' reasoning capabilities with limited data [3]. - The method allows diffusion models to understand the logical relationships between images and text prompts, leading to high-quality image generation [7]. Group 2: Algorithm Design - ThinkDiff transfers the reasoning capabilities of large visual language models (VLM) to diffusion models, combining the strengths of both for improved multimodal understanding [7]. - The architecture involves aligning VLM-generated tokens with the diffusion model's decoder, enabling the diffusion model to inherit the VLM's reasoning abilities [15]. Group 3: Training Process - The training process includes a vision-language pretraining task that aligns VLM with the LLM decoder, facilitating the transfer of multimodal reasoning capabilities [11][12]. - A masking strategy is employed during training to ensure the alignment network learns to recover semantics from incomplete multimodal information [15]. Group 4: Variants of ThinkDiff - ThinkDiff has two variants: ThinkDiff-LVLM, which aligns large-scale VLMs with diffusion models, and ThinkDiff-CLIP, which aligns CLIP with diffusion models for enhanced text-image combination capabilities [16]. Group 5: Experimental Results - ThinkDiff-LVLM significantly outperforms existing methods on the CoBSAT benchmark, demonstrating high accuracy and quality in multimodal understanding and generation [18]. - The training efficiency of ThinkDiff-LVLM is notable, achieving optimal results with only 5 hours of training on 4 A100 GPUs, compared to other methods that require significantly more resources [20][21]. Group 6: Comparison with Other Models - ThinkDiff-LVLM exhibits capabilities comparable to commercial models like Gemini in everyday image reasoning and generation tasks [25]. - The method also shows potential in multimodal video generation by adapting the diffusion decoder to generate high-quality videos based on input images and text [34]. Group 7: Conclusion - ThinkDiff represents a significant advancement in multimodal understanding and generation, providing a unified model that excels in both quantitative and qualitative assessments, contributing to the fields of research and industrial applications [36].
特朗普AI计划在GitHub上泄露,网友怒喷用AI代码“治国”!
AI前线· 2025-06-16 07:37
Core Viewpoint - The article discusses the recent leak of the AI.gov project code, which is part of the Trump administration's initiative to integrate AI into government operations, raising concerns about the over-reliance on AI in public sectors and the potential risks associated with it [1][8][9]. Group 1: AI.gov Project Overview - The AI.gov project aims to serve as a hub for government agencies to implement AI, led by Thomas Shedd, who has a background in software integration at Tesla [2][4]. - The project is set to officially launch on July 4, coinciding with Independence Day, and includes three main components: a chatbot, an integrated API for connecting to AI models, and a tool called "CONSOLE" for monitoring AI usage within agencies [4][5]. Group 2: Concerns and Criticism - The leak has sparked public dissatisfaction regarding the government's heavy reliance on AI, with critics highlighting past failures of AI tools in government decision-making, such as the flawed AI tool used to evaluate contracts at the Veterans Affairs department [8][9][11]. - Experts have raised alarms about the potential for significant errors in AI-driven decisions, emphasizing that complex tasks should not be solely entrusted to AI systems [11][12]. Group 3: Broader Implications of AI in Government - The article notes that the Trump administration's approach to AI is more lenient compared to the Biden administration, with a focus on reducing regulatory oversight and promoting domestic AI companies [8][9]. - There are concerns about data security and the risks of centralizing sensitive information, which could lead to larger vulnerabilities in the event of a data breach [12][13].
State-Of-The-Art Prompting For AI Agents
Y Combinator· 2025-05-30 14:00
Prompt Engineering & Metaprompting - Metaprompting is emerging as a powerful tool, likened to coding in 1995 due to the evolving nature of the tools [1] - The best prompts often start by defining the role of the LLM, detailing the task, and outlining a step-by-step plan, often using markdown-style formatting [1] - Vertical AI agent companies are exploring how to balance flexibility for customer-specific logic with maintaining a general-purpose product, considering forking and merging prompts [1] - An emerging architecture involves defining a system prompt (company API), a developer prompt (customer-specific context), and a user prompt (end-user input) [1] - Worked examples are crucial for improving output quality, and automating the process of extracting and ingesting these examples from customer data is a valuable opportunity [2] - Prompt folding allows a prompt to dynamically generate better versions of itself by feeding it examples where it failed [2] - When LLMs lack sufficient information, it's important to provide them with an "escape hatch" to avoid hallucinations, either by allowing them to ask for more information or by providing debug info in the response [2] Evaluation & Model Personalities - Evals are considered the "crown jewels" for AI companies, essential for understanding why a prompt was written a certain way and for improving it [3] - Different LLMs exhibit distinct personalities; for example, Claude is considered more steerable, while Llama 4 requires more steering and prompting [5] - When using LLMs to generate numerical scores, providing rubrics is best practice, but models may interpret and apply these rubrics with varying degrees of rigidity and flexibility [5] Founder Role & Forward Deployed Engineer - Founders need to deeply understand their users and codify these insights into specific evals to ensure the software works for them [3] - Founders should act as "forward deployed engineers," directly engaging with users to understand their needs and rapidly iterate on the product [4] - The forward deployed engineer model, combined with AI, enables faster iteration and closing of significant deals with large enterprises [5]
文旅新玩法!藏师傅教你做食物微缩景观宣传海报&视频
歸藏的AI工具箱· 2025-05-28 08:06
Core Viewpoint - The article discusses the creative use of AI tools like GPT-4o and Veo3 to generate visually appealing food-themed images and miniature scenes, highlighting their potential for tourism promotion and artistic expression [1][4][9]. Group 1: Image Generation Ideas - The article presents a concept for a surreal keyboard where each key is represented by a miniature dessert, emphasizing vibrant colors and realistic textures [2][5]. - A new idea combines food and cityscapes, suggesting the creation of miniature scenes made from representative foods of different cities, which could serve as promotional material [4][6]. - The use of Veo3 for creating time-lapse animations of culinary scenes is explored, showcasing the gradual assembly of ingredients into a complete miniature landscape [6][7]. Group 2: Specific Scene Descriptions - A detailed description of a "Chengdu" themed scene is provided, featuring a hot pot and playful panda elements, with ingredients creatively arranged to form landscapes and rivers [5][8]. - The scene captures the essence of Chengdu's culinary culture, with a playful and vibrant atmosphere, making it suitable for tourism marketing [5][8]. Group 3: Tools and Techniques - The article mentions the use of Veo3 and Gemini Pro membership for enhanced video creation capabilities, encouraging users to experiment with these tools [9]. - It highlights the potential of using Flow's capabilities for creating seamless video transitions, although it notes the higher costs associated with this option [6][9].
深度|黄仁勋Global Conference发言:AI工厂是下一个千兆瓦级产业革命,英伟达正建造多座五六百亿美元投入的AI工厂
Z Potentials· 2025-05-13 02:44
Core Insights - The article discusses the rise of AI factories as a new generation of infrastructure, which is expected to redefine various industries and create a multi-trillion dollar economic impact [3][5][7] - AI technology is seen as a revolutionary force that can automate tasks and expand the digital workforce, fundamentally changing the labor market and skill requirements [4][6][8] Group 1: AI Factory Revolution - AI is considered the next industrial revolution, with capabilities that include perception, content generation, language translation, reasoning, and problem-solving [3] - AI factories are being built with investments of approximately $50-60 billion each, and it is anticipated that dozens of gigawatt-scale AI factories will be constructed globally in the next decade [4][8] - The AI factory industry is emerging as a new sector that will serve as the foundational infrastructure for various industries, similar to previous generations of information and energy infrastructure [5][7] Group 2: Impact on Labor Market - The introduction of advanced AI technologies is expected to eliminate millions of jobs while simultaneously creating new ones, leading to a significant transformation in the workforce [6][7] - The potential for AI to bridge the technological gap is highlighted, as it allows a broader population to engage with technology that was previously accessible only to a select few [8] - AI is viewed as a means to enhance global GDP by reintegrating millions of people into the labor market, addressing current labor shortages [7][8] Group 3: Chip Industry and Long-term Strategy - NVIDIA is positioned as a leader in the AI infrastructure space, with a focus on building a comprehensive ecosystem that includes chip design, system development, and software integration [13][14] - The company emphasizes the importance of understanding customer needs to drive innovation and improve technology architecture [17][18] - The future demand for AI is expected to grow significantly in sectors such as healthcare, life sciences, and advanced manufacturing, with a shift towards robotic systems in factories [18][19]
超越DeepSeek?巨头们不敢说的技术暗战
3 6 Ke· 2025-04-29 00:15
Group 1: DeepSeek-R1 Model and MLA Technology - The launch of the DeepSeek-R1 model represents a significant breakthrough in AI technology in China, showcasing a competitive performance comparable to industry leaders like OpenAI, with a 30% reduction in required computational resources compared to similar products [1][3] - The multi-head attention mechanism (MLA) developed by the team has achieved a 50% reduction in memory usage, but this has also increased development complexity, extending the average development cycle by 25% in manual optimization scenarios [2][3] - DeepSeek's unique distributed training framework and dynamic quantization technology have improved inference efficiency by 40% per unit of computing power, providing a case study for the co-evolution of algorithms and system engineering [1][3] Group 2: Challenges and Innovations in AI Infrastructure - The traditional fixed architecture, especially GPU-based systems, faces challenges in adapting to the rapidly evolving demands of modern AI and high-performance computing, often requiring significant hardware modifications [6][7] - The energy consumption of AI data centers is projected to rise dramatically, with future power demands expected to reach 600kW per cabinet, contrasting sharply with the current capabilities of most enterprise data centers [7][8] - The industry is witnessing a shift towards intelligent software-defined hardware platforms that can seamlessly integrate existing solutions while supporting future technological advancements [6][8] Group 3: Global AI Computing Power Trends - Global AI computing power spending has surged from 9% in 2016 to 18% in 2022, with expectations to exceed 25% by 2025, indicating a shift in computing power from infrastructure support to a core national strategy [9][11] - The scale of intelligent computing power has increased significantly, with a 94.4% year-on-year growth from 232EFlops in 2021 to 451EFlops in 2022, surpassing traditional computing power for the first time [10][11] - The competition for computing power is intensifying, with major players like the US and China investing heavily in infrastructure to secure a competitive edge in AI technology [12][13] Group 4: China's AI Computing Landscape - China's AI computing demand is expected to exceed 280EFLOPS by the end of 2024, with intelligent computing accounting for over 30%, driven by technological iterations and industrial upgrades [19][21] - The shift from centralized computing pools to distributed computing networks is essential to meet the increasing demands for real-time and concurrent processing in various applications [20][21] - The evolution of China's computing industry is not merely about scale but involves strategic breakthroughs in technology sovereignty, industrial security, and economic resilience [21]