Workflow
视觉基础模型
icon
Search documents
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
巧妙!一个传统技术让国产视觉基础模型直接上大分
量子位· 2025-05-23 06:14
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 咱就是说啊, 视觉基础模型 这块儿,国产AI真就是上了个大分—— Glint-MVT ,来自格灵深瞳的最新成果。 先来看下成绩—— 线性探测 (LinearProbing): 简单来说,线性探测是一种测试预训练模型效果的小技巧,测的就是基本功扎不扎实。它的做法是: 把模型最后一部分换成简单的线性层,其他部分全部保持原样不动;然后只训练这个新加的线性层,通过它的表现来判断模型之前学到的特 征好不好用。 再来看应用效果。 如果说视觉基础模型是一个底座,那么它的下游任务,像 "图像理解+分割一切" ,便是更为直观的效果展现。 例如下面这张图片,然后我们可以问一下AI: 你能提供一个分割掩膜给这个图像中触摸篮球的人吗? 很显然,这个任务的难点在于拿篮球的人被其他人的手、身体等部位挡住,分割难度也大幅增加。 然而,国产AI是不在怕的,啪的一下,超精细地把要求的人物给抠了出来: 我们再来看下更加复杂的案例: 这个测试是在26个分类测试集中跟CLIP和OpenCLIP做了对比,结果显示,国产视觉基础模型平均准确率比OpenCLIP高2.3%,比CLIP高 1.1%。 面 ...