元素感知对齐(EAL)
Search documents
攻克结构化长文档检索难题!新框架让模型告别“结构性失明”
量子位· 2025-09-25 11:42
Core Insights - The article introduces SEAL (Structure and Element Aware Learning), a new contrastive learning framework designed to enhance the understanding of long documents by models through structural awareness and element alignment [1][8]. Group 1: SEAL Framework Overview - SEAL innovatively integrates both the macro-level structure and micro-level semantic elements of documents into a unified embedding space, significantly improving pre-trained language models' ability to understand and represent structured data [3]. - The framework addresses two main challenges in long document retrieval: how to make models aware of document hierarchy and how to promote precise alignment between user queries and specific document elements [18] [25]. Group 2: Training Strategies - The framework employs two complementary training strategies: Structure Aware Learning (SAL) and Element Aware Learning (EAL) [9]. - SAL focuses on understanding the "skeleton" of documents by presenting models with two versions of a document—one with structural tags and one without, encouraging the model to learn the inherent structural functions of text segments [12][13]. - EAL enhances the model's grasp of local elements' semantic roles by introducing a masking mechanism, requiring the model to infer overall document relevance based on incomplete information [14][15]. Group 3: Experimental Results - The application of the SEAL framework led to a notable improvement in the BGE-M3 model's retrieval ranking quality, with the MRR@10 metric increasing from 73.96% to 77.84% [17][19]. - The results indicate enhanced capability in ranking more relevant results higher, validated by online A/B testing [20]. Group 4: Open Source Dataset - The team released a new dataset named StructDocRetrieval, containing long documents with structural annotations, significantly surpassing typical short datasets like MS MARCO [21][22]. - This dataset, utilizing HTML format, provides rich structural semantic annotations, filling a gap in the field [23]. Group 5: Broader Implications - The SEAL method's refined understanding of structural information can provide more reliable information sources for downstream tasks, such as aiding AI assistants in accurately locating technical document answers [25]. - The framework shows promising applications in specialized fields like enterprise knowledge bases and legal technology [25].