多模态智能体
Search documents
GPT-4o连验证码都解不了??SOTA模型成功率仅40%
量子位· 2025-06-04 05:21
Core Viewpoint - The article discusses the launch of Open CaptchaWorld, a research platform aimed at evaluating the capabilities of multi-modal agents in solving CAPTCHA challenges, highlighting the significant gap between human performance and that of current state-of-the-art models [1][5][33]. Group 1: CAPTCHA Challenges for Multi-modal Agents - CAPTCHA presents a major bottleneck for deploying multi-modal agents in real-world scenarios, particularly in high-value web applications like e-commerce and ticketing [4][5]. - Current benchmarks often overlook CAPTCHA challenges, which are not exceptions but rather common obstacles in practical tasks [4][5]. - Open CaptchaWorld is designed to systematically assess agents' performance in solving CAPTCHAs, providing a comprehensive evaluation framework [5][11]. Group 2: Performance Metrics and Findings - Human success rates in solving CAPTCHAs average at 93.3%, while state-of-the-art multi-modal models achieve only 5%-40% success rates [2][11]. - The platform includes 20 types of modern CAPTCHAs, with a total of 225 samples, covering various interaction types such as clicking sequences and image selections [9][11]. - A new evaluation metric, CAPTCHA Reasoning Depth, quantifies the cognitive complexity involved in solving CAPTCHAs, offering a more nuanced understanding of agent capabilities [11][19]. Group 3: Agent Behavior and Efficiency - Many advanced agents exhibit inefficient problem-solving behaviors, often over-complicating tasks and leading to increased error rates [22][24]. - The analysis reveals that while OpenAI-o3 has the highest success rate at 40%, it also incurs the highest operational costs, indicating a trade-off between performance and cost [28][30]. - Other models like Gemini2.5-Pro and GPT-4.1 show a better balance of success rates (around 25%) and cost efficiency, suggesting a need for optimization in future model designs [29][30]. Group 4: Implications for Future Research - The introduction of Open CaptchaWorld encourages researchers to confront CAPTCHA challenges directly, as overcoming these obstacles is essential for real-world deployment of agents [33]. - The findings highlight the need for new CAPTCHA designs that can adapt to the evolving capabilities of agents, ensuring ongoing relevance and security [34].
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].
字节开源多模态智能体UI-TARS-1.5 重点强化高阶推理能力
Sou Hu Cai Jing· 2025-04-19 06:27
Core Insights - ByteDance's Seed Lab has officially released and open-sourced the next-generation multimodal intelligent agent UI-TARS-1.5, which significantly enhances high-level reasoning capabilities compared to its predecessor [1][2] Performance Metrics - UI-TARS-1.5 outperforms previous models in various benchmarks, achieving scores of 42.5 in OSworld (100 steps), 42.1 in Windows Agent Arena (50 steps), 84.8 in WebVoyager, and 75.8 in Online-Mind2web [2] - The model shows a notable improvement in Android World with a score of 64.2, compared to previous models [2] Technological Advancements - The model's capabilities are derived from four key technological dimensions: enhanced visual perception, System 2 reasoning mechanism, unified action modeling, and self-evolving training paradigm [3] - Enhanced visual perception allows the model to understand interface elements deeply, providing a reliable information foundation for decision-making [3] - The System 2 reasoning mechanism enables multi-step planning and decision-making, mimicking human thought processes [3] - Unified action modeling improves action control and execution precision through a standardized action space [3] - The self-evolving training paradigm allows the model to learn from mistakes and adapt to complex task environments [3] Practical Applications - UI-TARS-1.5 functions as a practical "digital assistant," capable of operating computers and systems, controlling browsers, and completing complex interactive tasks [4]
万联证券:万联晨会-20250311
Wanlian Securities· 2025-03-11 03:24
Core Insights - The report indicates that consumption remains the primary driving force for economic development, with policies aimed at enhancing consumer confidence and spending capacity [10][12] - The government plans to implement a "Special Action Plan to Boost Consumption," focusing on improving consumption capacity, increasing quality supply, and enhancing the consumption environment [10][12] Market Review - On Monday, the A-share market experienced fluctuations, with the Shanghai Composite Index closing down 0.19% at 3,366.16 points, the Shenzhen Component Index down 0.17%, and the ChiNext Index down 0.25% [2][7] - The total trading volume in the A-share market reached 1.51 trillion RMB, with over 3,000 stocks rising [2][7] - In the industry sector, coal and non-ferrous metals led the gains, while the computer and media sectors saw declines [2][7] - The Hong Kong Hang Seng Index fell by 1.85%, and the Hang Seng Technology Index dropped by 2.52% [2][7] - Internationally, all three major U.S. stock indices declined, with the Dow Jones down 2.08%, the S&P 500 down 2.70%, and the Nasdaq down 4.00% [2][7] Important News - The Guangdong Provincial Government issued policies to promote innovation in the artificial intelligence and robotics industries, focusing on key technology breakthroughs, nurturing quality enterprises, and enhancing application scenarios [3][8] - The China Academy of Information and Communications Technology has initiated the compilation of technical standards for multimodal intelligent agents to accelerate their industrial application [3][8] Investment Highlights - The consumption policy will focus on two main areas: enhancing the "trade-in" policy and improving service quality [10][12] - The subsidy for the trade-in policy will increase from 150 billion RMB to 300 billion RMB, expanding the categories eligible for subsidies [10][12] - The report suggests that the food and beverage sector, particularly the liquor industry, will face increased tax burdens but may benefit from a shift towards direct sales to mitigate tax pressures [12] - The report highlights the potential growth in the social services sector, particularly in tourism and hospitality, driven by improved vacation policies and the expansion of the inbound tourism market [12]