不再靠「猜坐标」！颜水成团队等联合发布PaDT多模态大模型：实现真正的多模态表征输出

Core Insights - The article discusses the advancements in Multimodal Large Language Models (MLLMs) and introduces a new paradigm called Patch-as-Decodable Token (PaDT) to address the limitations of existing models in tasks requiring fine spatial understanding [2][6]. Group 1: PaDT Overview - PaDT proposes a revolutionary approach by dividing images into multiple visual patches and allowing the model to generate corresponding Visual Reference Tokens (VRTs) directly [3]. - It enables seamless alternation between text tokens and visual tokens at both input and output stages, making the model's description of image content as natural as describing text [4]. - The model can directly indicate image targets in generated sentences rather than guessing coordinates [5]. Group 2: Limitations of Traditional MLLMs - Traditional MLLMs output detection box coordinates in string format, leading to inconsistencies, semantic disconnection, and weak image-text associations [8]. - The output format can vary, making it difficult to parse targets, and numbers can be split into separate tokens, disrupting spatial continuity [8]. - The reliance on coordinate tokens, which lack inherent semantic meaning, results in challenges such as hallucination and repetition in generated outputs [8]. Group 3: PaDT Mechanism - PaDT introduces VRTs derived from the visual patch embeddings of the input image, creating a dynamic embedding table that integrates both text and visual information [11]. - This design avoids the pitfalls of traditional methods that depend on global visual codebooks, which can confuse similar objects and generate non-existent patches [13]. - The lightweight PaDT Decoder, consisting of three bidirectional attention blocks, transforms VRTs into structured visual outputs like bounding boxes and segmentation masks [15]. Group 4: Performance Metrics - PaDT Pro (3B) achieved a remarkable average accuracy of 93.6 in the RefCOCO/+/g referring expression comprehension task, surpassing the 78B InternVL3 model, which scored 91.4 [21][22]. - In the COCO open vocabulary detection task, traditional MLLMs typically have a mean Average Precision (mAP) below 20, while PaDT Pro (3B) raised it to 38.2, nearly doubling the performance [21][24]. - The model also demonstrated strong performance in the Referring Image Captioning (RIC) task, significantly improving the CIDEr-D score from 0.386 to 1.450 [24]. Group 5: Implications and Future Directions - PaDT's success stems from its deep understanding of the visual capability bottlenecks in MLLMs, allowing for native alignment between visual patches and generated tokens [31]. - The dynamic embedding mechanism ensures strong binding of VRTs to the current image, preventing cross-image confusion [31]. - The model exhibits robust multitasking capabilities, outperforming single-task models by seamlessly switching tasks through prompt changes [33]. - The introduction of PaDT marks a significant step towards achieving true multimodal intelligence, allowing for more natural interactions between different modalities [35].