视觉基础模型

Search documents
ICCV 2025 Highlight | 3D真值生成新范式,开放驾驶场景的语义Occupancy自动化标注!
机器之心· 2025-08-29 00:15
Core Viewpoint - The article presents AutoOcc, an innovative framework for automatic open-ended 3D semantic occupancy annotation that surpasses existing methods without requiring human labeling, showcasing excellent generalization capabilities [5][11][26]. Summary by Sections Introduction - AutoOcc is developed by the VDIG laboratory at Peking University, led by researchers Zhou Xiaoyu and Wang Yongtao, and has been recognized in top conferences and competitions in the computer vision field [2][4]. Problem Statement - The challenge of generating accurate and complete semantic occupancy annotations from raw sensor data at low cost remains significant in the fields of autonomous driving and embodied intelligence [5][8]. Methodology - AutoOcc utilizes a vision-language model (VLM) to create semantic attention maps for scene description and dynamically expands the semantic list, while a self-estimating optical flow module identifies and processes dynamic objects in temporal rendering [5][11][17]. Key Innovations - The framework introduces a 3D Gaussian representation (VL-GS) that effectively models complete 3D geometry and semantics in driving scenarios, demonstrating superior representation efficiency, accuracy, and perception capabilities [6][17]. Experimental Results - Extensive experiments indicate that AutoOcc outperforms existing automated 3D semantic occupancy annotation methods and exhibits remarkable zero-shot generalization across datasets [7][21][22]. Comparison with Existing Methods - AutoOcc is compared with traditional methods that rely on human labeling and extensive post-processing, highlighting its speed and open-ended semantic annotation capabilities [14][21]. Performance Metrics - The framework shows significant advantages in terms of robustness and open semantic labeling ability, achieving state-of-the-art performance in both specific semantic categories and across datasets [20][21]. Efficiency Evaluation - AutoOcc demonstrates a notable reduction in computational costs while enhancing annotation performance, achieving a balance between efficiency and flexibility without relying on human annotations [24][25]. Conclusion - The article concludes that AutoOcc represents a significant advancement in automated open semantic 3D occupancy annotation, integrating visual language model guidance with differentiable 3D Gaussian techniques [26].
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
巧妙!一个传统技术让国产视觉基础模型直接上大分
量子位· 2025-05-23 06:14
Core Viewpoint - The article highlights the significant advancements in domestic AI, particularly focusing on the Glint-MVT model developed by GeLing Deep Vision, which demonstrates superior performance in visual foundational models compared to international counterparts like CLIP and OpenCLIP [1][2]. Performance Evaluation - The linear probing technique was used to assess the pre-trained model's effectiveness, showing that the domestic visual foundational model achieved an average accuracy rate 2.3% higher than OpenCLIP and 1.1% higher than CLIP across 26 classification test sets [2]. Application Effectiveness - The Glint-MVT model excels in downstream tasks such as image understanding and segmentation, showcasing its ability to accurately segment complex images and identify objects even when partially obscured [4][8][12]. Technical Innovations - The Glint-MVT model incorporates a Margin-based pretrained Vision Transformer (MVT) and introduces the Margin Softmax loss function, which enhances the model's generalization ability by reducing data noise impact [13][26]. - The model utilizes virtual category construction by clustering large datasets, such as LAION 400M, into one million virtual categories, improving data scale efficiency [28]. Model Variants - The Glint-RefSeg model, built on Glint-MVT, achieves state-of-the-art (SOTA) performance in referring expression segmentation without the need for extensive training data [14]. - The MVT-VLM model demonstrates strong capabilities in image understanding, accurately identifying details such as the color and number on athletes' jerseys [15][16]. Broader Applications - Glint-RefSeg is also applicable in video segmentation, maintaining accuracy even with dynamic scenes, as demonstrated in a video of Bruno Mars [19][21]. - The model's versatility extends to embodied intelligence scenarios, effectively answering contextual questions about object placements [22][25]. Company Development - GeLing Deep Vision has been a pioneer in computer vision since 2013, focusing on practical applications of AI technology to address industry pain points, as exemplified by the Glint-MVT model [36][37]. - The company emphasizes a balance between technical innovation and practical application, avoiding the pursuit of mere academic metrics [38][39]. Community Engagement - GeLing Deep Vision adopts an open-source approach while maintaining a focus on innovation, aiming to foster a collaborative ecosystem that encourages community contributions [40]. - The leadership, including the director of the algorithm research institute, emphasizes the importance of youthful thinking and practical experience in driving technological advancements [41][42]. Industry Perspective - The article suggests that the development of AI technology is transitioning from general exploration to specialized applications, with companies like GeLing Deep Vision playing a crucial role in this evolution [44].