多模态智能体

Search documents
GPT-4o连验证码都解不了??SOTA模型成功率仅40%
量子位· 2025-06-04 05:21
Core Viewpoint - The article discusses the launch of Open CaptchaWorld, a research platform aimed at evaluating the capabilities of multi-modal agents in solving CAPTCHA challenges, highlighting the significant gap between human performance and that of current state-of-the-art models [1][5][33]. Group 1: CAPTCHA Challenges for Multi-modal Agents - CAPTCHA presents a major bottleneck for deploying multi-modal agents in real-world scenarios, particularly in high-value web applications like e-commerce and ticketing [4][5]. - Current benchmarks often overlook CAPTCHA challenges, which are not exceptions but rather common obstacles in practical tasks [4][5]. - Open CaptchaWorld is designed to systematically assess agents' performance in solving CAPTCHAs, providing a comprehensive evaluation framework [5][11]. Group 2: Performance Metrics and Findings - Human success rates in solving CAPTCHAs average at 93.3%, while state-of-the-art multi-modal models achieve only 5%-40% success rates [2][11]. - The platform includes 20 types of modern CAPTCHAs, with a total of 225 samples, covering various interaction types such as clicking sequences and image selections [9][11]. - A new evaluation metric, CAPTCHA Reasoning Depth, quantifies the cognitive complexity involved in solving CAPTCHAs, offering a more nuanced understanding of agent capabilities [11][19]. Group 3: Agent Behavior and Efficiency - Many advanced agents exhibit inefficient problem-solving behaviors, often over-complicating tasks and leading to increased error rates [22][24]. - The analysis reveals that while OpenAI-o3 has the highest success rate at 40%, it also incurs the highest operational costs, indicating a trade-off between performance and cost [28][30]. - Other models like Gemini2.5-Pro and GPT-4.1 show a better balance of success rates (around 25%) and cost efficiency, suggesting a need for optimization in future model designs [29][30]. Group 4: Implications for Future Research - The introduction of Open CaptchaWorld encourages researchers to confront CAPTCHA challenges directly, as overcoming these obstacles is essential for real-world deployment of agents [33]. - The findings highlight the need for new CAPTCHA designs that can adapt to the evolving capabilities of agents, ensuring ongoing relevance and security [34].
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].