视觉语言模型
Search documents
AI们数不清六根手指,这事没那么简单
Hu Xiu· 2025-07-11 02:54
Core Viewpoint - The article discusses the limitations of AI models in accurately interpreting images, highlighting that these models rely on memory and biases rather than true visual observation [19][20][48]. Group 1: AI Model Limitations - All tested AI models, including Grok4, OpenAI o3, and Gemini, consistently miscounted the number of fingers in an image, indicating a systemic issue in their underlying mechanisms [11][40]. - A recent paper titled "Vision Language Models are Biased" explains that large models do not genuinely "see" images but instead rely on prior knowledge and memory [14][19]. - The AI models demonstrated a strong tendency to adhere to preconceived notions, such as the belief that humans have five fingers, leading to incorrect outputs when faced with contradictory evidence [61][64]. Group 2: Experiment Findings - Researchers conducted experiments where AI models were shown altered images, such as an Adidas shoe with an extra stripe, yet all models incorrectly identified the number of stripes [39][40]. - In another experiment, AI models struggled to accurately count legs on animals, achieving correct answers only 2 out of 100 times [45]. - The models' reliance on past experiences and biases resulted in significant inaccuracies, even when prompted to focus solely on the images [67]. Group 3: Implications for Real-World Applications - The article raises concerns about the potential consequences of AI misjudgments in critical applications, such as quality control in manufacturing, where an AI might overlook defects due to its biases [72][76]. - The reliance on AI for visual assessments in safety-critical scenarios, like identifying tumors in medical imaging or assessing traffic situations, poses significant risks if the AI's biases lead to incorrect conclusions [77][78]. - The article emphasizes the need for human oversight in AI decision-making processes to mitigate the risks associated with AI's inherent biases and limitations [80][82].
AI们数不清六根手指,这事没那么简单。
数字生命卡兹克· 2025-07-10 20:40
Core Viewpoint - The article discusses the inherent biases in AI visual models, emphasizing that these models do not truly "see" images but rely on memory and preconceived notions, leading to significant errors in judgment [8][24][38]. Group 1: AI Model Limitations - All tested AI models consistently miscounted the number of fingers in an image, with the majority asserting there were five fingers, despite the image showing six [5][12][17]. - A study titled "Vision Language Models are Biased" reveals that AI models often rely on past experiences and associations rather than actual visual analysis [6][8][18]. - The models' reliance on prior knowledge leads to a failure to recognize discrepancies in images, as they prioritize established beliefs over new visual information [24][28][36]. Group 2: Implications of AI Bias - The article highlights the potential dangers of AI biases in critical applications, such as quality control in manufacturing, where AI might overlook defects due to their rarity in the training data [30][34]. - The consequences of these biases can be severe, potentially leading to catastrophic failures in real-world scenarios, such as automotive safety [33][35]. - The article calls for a cautious approach to relying on AI for visual judgments, stressing the importance of human oversight and verification [34][39].
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].
今年大火的目标导航到底是什么?从目标搜索到触达有哪些路线?
具身智能之心· 2025-06-26 14:19
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, enhancing service efficiency [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: - First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. - Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. - Third Generation: Integration of large language models (LLMs) and visual language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, making it challenging for newcomers to enter the domain [9]. - A new course has been developed to address these challenges, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course will cover the theoretical foundations and technical lineage of Goal-Oriented Navigation, including task definitions and evaluation benchmarks [15]. - It will also delve into the Habitat simulation ecosystem, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][20][22]. - A significant project will focus on the reproduction of VLFM algorithms and their deployment in real-world scenarios [24].
上海交大最新!DyNaVLM:零样本、端到端导航框架
具身智能之心· 2025-06-22 10:56
Core Viewpoint - The article discusses the development of DyNaVLM, a zero-shot, end-to-end navigation framework that integrates vision-language models (VLM) to enhance navigation capabilities in dynamic environments, overcoming limitations of traditional methods [4][5]. Group 1: Introduction and Optimization Goals - Navigation is a fundamental capability in autonomous agents, requiring spatial reasoning, real-time decision-making, and adaptability to dynamic environments. Traditional methods face challenges in generalization and scalability due to their modular design [4]. - The advancement of VLMs offers new possibilities for navigation by integrating perception and reasoning within a single framework, although their application in embodied navigation is limited by spatial granularity and contextual reasoning capabilities [4]. Group 2: Core Innovations of DyNaVLM - **Dynamic Action Space Construction**: DyNaVLM introduces a dynamic action space that allows robots to determine navigation goals based on visual information and language instructions, enhancing movement flexibility in complex environments [6]. - **Collaborative Graph Memory Mechanism**: Inspired by retrieval-augmented generation (RAG), this mechanism enhances memory management for better navigation performance [8]. - **No-Training Deployment Mode**: DyNaVLM can be deployed without task-specific fine-tuning, reducing deployment costs and improving generalization across different environments and tasks [8]. Group 3: System Architecture and Methodology - **Problem Formalization**: The system takes inputs such as target descriptions and RGB-D observations to determine appropriate actions, maintaining a memory function to extract spatial features [11]. - **Memory Manager**: This component connects VLM and graph-structured memory, capturing spatial relationships and semantic object information [12]. - **Action Proposer and Selector**: The action proposer simplifies continuous search space into discrete candidates, while the selector generates final navigation actions based on geometric candidates and contextual memory [14][15]. Group 4: Experimental Evaluation - **Simulation Environment Evaluation**: DyNaVLM achieved a success rate (SR) of 45.0% and a path length weighted success rate (SPL) of 0.232 in ObjectNav benchmarks, outperforming previous VLM frameworks [19][22]. - **Real-World Evaluation**: DyNaVLM demonstrated superior performance in real-world settings, particularly in tasks requiring the identification of multiple targets, showcasing its robustness and efficiency in dynamic environments [27].
万马科技20250612
2025-06-12 15:07
摘要 万马科技通过收购有方科技切入车联网领域,车联网业务收入从 2021 年的 5,000 万元增长到 2024 年的 2.6 亿元,利润也显著提升,并已建 立完整的数据闭环工具链和智驾算力中心。 国内车联网行业渗透率约为 80%,海外市场渗透率不足 30%,随着智 能驾驶对数据需求的增加,国内外市场均有较大的发展空间,尤其 Robotaxi 对实时数据监控和技术要求更高,单车价值提升显著。 优卡科技提供蓝海全球车联和云自动驾驶数据闭环两大解决方案,支持 1,400 万辆车辆,客户包括吉利、上汽、东风和理想等,并在全球范围 内支持 Robotaxi 企业的业务布局。 Robotaxi 被视为车联网行业发展的"皇冠上的明珠",高盛预测中国 Robotaxi 市场年化增长率将达到 96%。目前已在北京、武汉、广州以 及香港、迪拜等地进行常态化运营,特斯拉也即将推出相关业务。 Robotaxi 运营对网络质量有极高要求,包括运行安全、用户交互、合 规性、自动驾驶数据采集和运维等方面,需要高清地图、车路协同、远 程脱困以及海量数据支持。 万马科技 20250612 据监控需求高,对技术和数据量要求也更高,从单车价值上 ...
中金《秒懂研报》 | 智能驾驶:引领出行变革的新时代
中金点睛· 2025-05-24 08:32
Group 1: Core Viewpoints - The article discusses the rapid development and potential of intelligent driving technology, highlighting its transformative impact on urban mobility and the automotive industry [1][2][3]. Group 2: Technology Engine Behind Intelligent Driving - The end-to-end architecture is a significant innovation in intelligent driving, reducing data annotation difficulty and optimizing data processing through unique algorithms, which enhances vehicle responsiveness to road conditions [2][3]. - The introduction of visual language models and cloud models improves the system's ability to handle complex scenarios, akin to equipping vehicles with sharper "eyes" [3]. Group 3: Current Development of Intelligent Driving - The high-speed Navigation on Autopilot (NOA) feature is expected to be scaled up in 2024, becoming a standard for intelligent driving vehicles priced above 200,000 yuan [5]. - The penetration rate of urban NOA is projected to reach 6.5% in 2024, driven by increased consumer acceptance and reduced costs, expanding its availability to more consumers [7]. Group 4: Business Model of Intelligent Driving - The L2++ intelligent driving software faces challenges in charging fees due to low consumer willingness to pay, leading most automakers to standardize systems to accumulate users and data [11]. - Some leading automakers are exploring buyout or subscription payment models, with promotional activities to attract customers [11][12]. Group 5: Benefits of Urban NOA - Urban NOA is expected to drive sales of high-configured, high-margin models, as consumers are likely to prefer higher-end vehicles once the technology gains market acceptance [13][14]. - The overlap in technology requirements between Robotaxi and urban NOA is anticipated to enhance intelligent driving system capabilities, potentially leading to a shift towards mobility services by 2025 [15]. Group 6: Globalization of Intelligent Driving Industry - China's late start in intelligent driving is countered by rapid development, with domestic companies gaining advantages in technology and production experience, positioning them favorably in the global market [16]. - Collaborations between joint venture automakers and domestic intelligent driving companies are expected to facilitate access to international projects and opportunities for global expansion [16][17].
智能辅助驾驶竞速与暗战:自研派VS合作派,功能水平分化加剧
Bei Ke Cai Jing· 2025-05-22 10:37
Core Insights - The article discusses the advancements and competitive landscape of the assisted driving industry, highlighting various companies' self-developed systems and strategies [1][4]. Group 1: Company Developments - Li Auto has launched its new generation dual-system intelligent driving solution, focusing on upgrading driving capabilities and synchronizing updates for smart electric vehicles [3]. - NIO's intelligent assisted driving system has reportedly avoided over 3.5 million collision risks, accumulating a total driving mileage of approximately 4.94 billion kilometers as of May 15, 2025 [3]. - Chery's Hawk 500 has achieved widespread adoption of assisted driving features, with the Hawk 700 targeting mid-to-high-end models and the Hawk 900 positioned as a flagship [3]. - GAC Group's GSD intelligent driving assistance system has accumulated 5 million user driving scenarios and over 40 million kilometers of high-level autonomous driving data [3]. Group 2: Industry Trends - BYD and XPeng are recognized as leaders in self-developed intelligent driving systems, with BYD's high-end system named "Tianshen Eye" [4]. - Bosch's China president has expressed skepticism about the self-development model, suggesting that mid-level intelligent driving should become standard and that costs could be better managed through supply chain partnerships [4]. - Huawei is positioned as a top player in the intelligent driving system market, with plans for 10 brands from 7 automakers to adopt its solutions, potentially exceeding 500,000 vehicles [4][5]. - Huawei's collaboration models include component supply, Huawei Inside (HI) partnerships, and deep cooperation with automakers, with the latter being the most integrated approach [5]. Group 3: Strategic Partnerships - SAIC Group has publicly stated its intention to maintain control over core technologies while also choosing to collaborate with Huawei [6]. - The partnerships with Huawei have led to increased sales for collaborating automakers, but questions remain about their ability to independently develop high-quality vehicles [6].
85倍速度碾压:苹果开源FastVLM,能在iphone直接运行的视觉语言模型
机器之心· 2025-05-16 16:31
Core Viewpoint - Apple has open-sourced FastVLM, an efficient vision-language model that can run directly on iPhones, significantly enhancing visual understanding capabilities [2][6]. Group 1: Model Features and Performance - FastVLM addresses size and speed issues, achieving an 85-fold increase in the speed of the first token output compared to traditional models [6]. - The model uses a new hybrid visual encoder, FastViTHD, which combines convolutional layers and transformer modules, reducing the number of visual tokens needed for image processing by 16 times compared to traditional ViT and 4 times compared to FastViT [6][16]. - FastVLM is available in three parameter sizes: 0.5B, 1.5B, and 7B, each with stage 2 and stage 3 fine-tuning weights [7]. Group 2: Technical Innovations - The research emphasizes the importance of image resolution in VLM performance, particularly for text and data-dense tasks, while also addressing the challenges of high-resolution image processing [12][13]. - FastViTHD is specifically designed to enhance VLM efficiency when processing high-resolution images, achieving significant improvements in accuracy and latency compared to existing methods [16][33]. - The model architecture includes five stages, with a total parameter count of 125.1M, which is smaller than most mainstream ViT architectures while maintaining competitive performance [36][37]. Group 3: Efficiency and Optimization - FastVLM demonstrates superior performance in accuracy-latency trade-offs, outperforming previous models like ViT and FastViT under various conditions [46][47]. - The model's design allows for dynamic input resolution adjustments, optimizing performance based on the specific task and hardware capabilities [48][49]. - FastVLM's performance surpasses traditional token pruning methods, achieving lower visual token counts while maintaining higher accuracy [50][51].
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].