GPT-4o连验证码都解不了？？SOTA模型成功率仅40%

Core Viewpoint - The article discusses the launch of Open CaptchaWorld, a research platform aimed at evaluating the capabilities of multi-modal agents in solving CAPTCHA challenges, highlighting the significant gap between human performance and that of current state-of-the-art models [1][5][33]. Group 1: CAPTCHA Challenges for Multi-modal Agents - CAPTCHA presents a major bottleneck for deploying multi-modal agents in real-world scenarios, particularly in high-value web applications like e-commerce and ticketing [4][5]. - Current benchmarks often overlook CAPTCHA challenges, which are not exceptions but rather common obstacles in practical tasks [4][5]. - Open CaptchaWorld is designed to systematically assess agents' performance in solving CAPTCHAs, providing a comprehensive evaluation framework [5][11]. Group 2: Performance Metrics and Findings - Human success rates in solving CAPTCHAs average at 93.3%, while state-of-the-art multi-modal models achieve only 5%-40% success rates [2][11]. - The platform includes 20 types of modern CAPTCHAs, with a total of 225 samples, covering various interaction types such as clicking sequences and image selections [9][11]. - A new evaluation metric, CAPTCHA Reasoning Depth, quantifies the cognitive complexity involved in solving CAPTCHAs, offering a more nuanced understanding of agent capabilities [11][19]. Group 3: Agent Behavior and Efficiency - Many advanced agents exhibit inefficient problem-solving behaviors, often over-complicating tasks and leading to increased error rates [22][24]. - The analysis reveals that while OpenAI-o3 has the highest success rate at 40%, it also incurs the highest operational costs, indicating a trade-off between performance and cost [28][30]. - Other models like Gemini2.5-Pro and GPT-4.1 show a better balance of success rates (around 25%) and cost efficiency, suggesting a need for optimization in future model designs [29][30]. Group 4: Implications for Future Research - The introduction of Open CaptchaWorld encourages researchers to confront CAPTCHA challenges directly, as overcoming these obstacles is essential for real-world deployment of agents [33]. - The findings highlight the need for new CAPTCHA designs that can adapt to the evolving capabilities of agents, ensuring ongoing relevance and security [34].