Workflow
GPT2
icon
Search documents
不调参、不费力,上海交大&上海AI Lab推出“记忆解码器”,任意LLM无缝自适应
3 6 Ke· 2025-08-26 09:17
Core Insights - The article discusses the challenges faced by large language models (LLMs) in specialized fields such as healthcare, finance, and law due to their lack of deep domain knowledge. It highlights the need for effective domain adaptation solutions to enhance LLM performance in these areas [2][4]. Group 1: Memory Decoder Innovation - The Memory Decoder is introduced as a "plug-and-play" pre-training memory module that can adapt various LLMs without modifying their parameters, enabling efficient domain adaptation across different model sizes [2][4]. - Experimental results show that the Memory Decoder effectively adapts Qwen and Llama models to biomedical, financial, and legal domains, achieving an average perplexity reduction of 6.17% [4][12]. Group 2: Architecture and Functionality - During the pre-training phase, the Memory Decoder learns to align its output distribution with that generated by a non-parametric retriever using a distribution alignment loss function [5][10]. - In the inference phase, it processes input data in parallel with the base language model, generating domain-enhanced predictions without additional retrieval costs [5][10]. Group 3: Performance Evaluation - The Memory Decoder demonstrates significant performance improvements across various GPT-2 model sizes on the WikiText-103 dataset, with a single Memory Decoder of 124 million parameters enhancing the performance of the entire GPT-2 series [11][12]. - In downstream tasks, the Memory Decoder maintains or improves performance across nine different NLP tasks, showcasing its ability to enhance domain adaptation while preserving general language capabilities [13][14]. Group 4: Cross-Model and Cross-Tokenizer Adaptation - The Memory Decoder exhibits strong cross-model adaptability, enhancing performance across different Qwen and Qwen2.5 models, regardless of their size [15][21]. - It also shows effective cross-tokenizer adaptation, allowing for the transfer of knowledge between different model architectures with minimal additional training [17][18]. Group 5: Knowledge-Intensive Reasoning Tasks - In knowledge-intensive reasoning tasks, the Memory Decoder outperforms traditional retrieval-augmented generation (RAG) methods, enhancing the model's ability to acquire factual knowledge while maintaining reasoning capabilities [19][20]. Group 6: Limitations and Future Directions - Despite its advantages, the Memory Decoder has limitations, such as the computational overhead associated with the KV data storage during the pre-training phase and the need for some parameter adjustments for cross-tokenizer adaptation [21][23].
KIMI K2:最前瞻的研究!OnlineRL新范式,大模型的又一DeekSeek时刻!
2025-07-19 14:02
KIMI K2:最前瞻的研究!OnlineRL 新范式,大模型的 又一 DeekSeek 时刻!20250718 摘要 Kimi K2 作为国内首个公开数据显示拥有万亿参数的 MOE 模型,其架 构与 Distill V3 相似,但专家拆解更细致,采用 CLIP 优化器缓解梯度输 出问题,并实现部分在线强化学习功能,通过融合多场景数据,在奖励 模型上优选最佳结果,产生高质量合成数据,推动开放式问题场景发展。 GPT2 引起轰动在于使用工具后能力提升显著(绝对提升 15%,相对提 升 80%),以及 Post-training 算力消耗超过 Pre-training,表明对算 力规模和 Skill-up 要求提高,促使海外构建更多大节点算力集群。 Kimi KR 模型因范式创新和强大的模型能力在海外引发讨论,即使是 Pre-training 版本,完成强化学习后有望对标甚至超越 GPT-3,并可能 超越国内外下一代模型,提升基础软硬件配套,推动短链和长链应用发 展。 从投资角度看,2025 年下半年进入预期兑现阶段,应关注最快落地的 项目和长期增量价值最大的项目。海外数据显示,云计算、基础软硬件 配套设施及实施 ...