大模型智能文档翻译
Search documents
亚马逊云科技-基于大模型智能文档翻译实践
Sou Hu Cai Jing· 2025-07-16 09:32
Core Insights - The presentation discusses Amazon Web Services' (AWS) practical experiences in intelligent document translation based on large models, focusing on ensuring terminology accuracy and adherence to corporate writing styles [1][21]. - The challenges faced include maintaining terminology accuracy while using large language models and ensuring compliance with corporate writing styles [4][21]. Group 1: Terminology Accuracy - Initially, AWS used a straightforward method of directly inputting hundreds of terms into the model's context, achieving a 90% accuracy rate with 200 term pairs [5][21]. - As the number of terms increased to over 1,000, AWS implemented the Aho-Corasick (AC) algorithm for efficient memory-based key-value matching, addressing limitations in context length and attention mechanisms [6][21]. - For larger datasets, AWS utilized OpenSearch Percolator, which allows for term indexing and retrieval, effectively handling fuzzy matching and special characters in terminology [6][18][21]. Group 2: Corporate Writing Style - To meet corporate writing style requirements, AWS introduced a sample library concept, leveraging historical translation documents to guide new translations [7][21]. - Instead of fine-tuning large models, which can be costly, AWS combined Retrieval Augmented Generation (RAG) and FuseShot to create a web knowledge base, providing a more cost-effective solution [8][21]. - The system allows for the integration of previous translations to ensure consistency in writing style, enhancing the overall translation quality [8][21]. Group 3: Engineering Challenges - AWS faced engineering challenges in translating PDF documents, including differences in information density between languages, which can lead to content expansion of about 30% when translating from Chinese to English [13][21]. - Solutions included dynamic recursive algorithms to optimize rendering and merging of text blocks to prevent translation errors caused by block segmentation [13][21]. - The system architecture supports both offline and online processes, allowing users to upload terminology libraries and translate documents efficiently [10][12][21]. Group 4: Positive Feedback Loop - The professional translation field exhibits a flywheel effect, where the accumulation of internal data assets enhances translation processes and can be applied to other areas such as AI proofreading and smart writing review [15][21]. - AWS's system enables users to upload their terminology and sample libraries, facilitating a continuous improvement cycle in translation quality and efficiency [15][21].