灾难性遗忘
Search documents
LLM 语境下,「持续学习」是否是 「记忆」 问题的最优解?
机器之心· 2025-11-16 01:30
Group 1 - The article discusses the concept of "Nested Learning" proposed by Google, which aims to address the memory management issues in LLMs (Large Language Models) and the challenges of catastrophic forgetting [5][6][8] - Nested Learning is presented as a multi-layered optimization problem, where models are seen as a series of interconnected sub-problems, allowing for the simultaneous learning of new skills while avoiding the loss of previously acquired knowledge [6][7] - The research introduces the "Continuous Memory System" (CMS), which treats memory as a system of multiple modules that update at different frequencies, enhancing the model's ability to manage memory effectively [6][7] Group 2 - The article highlights the importance of improving LLMs' memory capabilities to enable continual learning, allowing AI to retain contextual experiences, semantic knowledge, and procedural skills [8] - A proposed three-layer memory architecture includes Model Weights for general knowledge, KV Cache for intermediate results, and Context for relevant background information, facilitating appropriate responses from the model [8]
突破LLM遗忘瓶颈,谷歌「嵌套学习」让AI像人脑一样持续进化
机器之心· 2025-11-08 06:10
Core Insights - Google has introduced a new machine learning paradigm called Nested Learning, which allows models to continuously learn new skills without forgetting old ones, marking a significant advancement towards AI that evolves like the human brain [1][3][4]. Group 1: Nested Learning Concept - Nested Learning treats machine learning models as a series of interconnected optimization sub-problems, enabling a more efficient learning system [6][11]. - The approach bridges the gap between model architecture and optimization algorithms, suggesting they are fundamentally the same and can be organized into hierarchical optimization systems [7][16]. - This paradigm allows for different components of a model to update at varying frequencies, enhancing the model's ability to manage long-term and short-term memory [15][20]. Group 2: Implementation and Architecture - Google has developed a self-modifying architecture called Hope, based on Nested Learning principles, which outperforms existing models in language modeling and long-context memory management [8][24]. - Hope is an evolution of the Titans architecture, designed to execute infinite levels of contextual learning and optimize its memory through a self-referential process [24][26]. Group 3: Experimental Results - Evaluations show that Hope exhibits lower perplexity and higher accuracy in various language modeling and common-sense reasoning tasks compared to other architectures [27][30]. - The performance of different architectures, including Hope, Titans, and others, was compared in long-context tasks, demonstrating the effectiveness of the Nested Learning framework [30]. Group 4: Future Implications - Nested Learning provides a theoretical and practical foundation for bridging the gap between current LLMs' limitations and the superior continuous learning capabilities of the human brain, paving the way for the development of self-improving AI [30].
大模型微调范式认知再被颠覆?UIUC、Amazon团队最新研究指出SFT灾难性遗忘问题或被误解
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the impact of supervised fine-tuning (SFT) on the general capabilities of large language models (LLMs), suggesting that SFT does not always lead to a significant decline in general performance when a smaller learning rate is used [2][34] - The research challenges the long-held belief that domain-specific fine-tuning inevitably causes catastrophic forgetting of general capabilities, proposing that the choice of training strategy plays a crucial role [2][34] Experiment Details - The study utilized two domain-specific datasets, MedCalc and ESCI, which represent scenarios where open-source LLMs typically perform poorly, making them ideal for domain-specific SFT [5] - Various open-source LLMs were selected for experimentation, including Qwen3-8B and Gemma3-4B, with a focus on controlling the learning rate during SFT [6] Findings - **Finding 1**: Using a smaller learning rate (e.g., 1e-6) allows models to maintain strong performance in target domains while significantly reducing the decline in general capabilities [11] - **Finding 2**: For classification tasks, when the training objective includes only the final label, a wider range of learning rates can achieve ideal performance, as seen in the ESCI dataset [12][14] Theoretical Analysis - The research team concluded that smaller learning rates can effectively limit the decline in general performance, aligning with the experimental findings [17] - The analysis also indicated that when training targets only include final labels, the number of "hard tokens" encountered decreases, allowing for a broader acceptable learning rate range [17] Token Adaptive Loss Reweighting (TALR) - TALR is introduced as a method to dynamically adjust the loss weights of tokens based on their prediction probabilities, effectively reducing the impact of hard tokens during training [20] - The method allows for real-time updates of token weights, ensuring that the model's confidence levels guide the training process [21] Experimental Results - In experiments comparing various strategies to mitigate catastrophic forgetting, TALR demonstrated superior performance, especially under higher learning rates, maintaining domain gains while minimizing losses in general performance [26][27] Conclusion and Future Directions - The research emphasizes the continued importance of SFT in enhancing LLM capabilities, suggesting that while smaller learning rates and TALR are effective, further exploration of more robust strategies is necessary to address the forgetting problem [34][35] - Future research should focus on balancing domain-specific performance with general capabilities, particularly in specialized fields like medicine, where retaining foundational knowledge is crucial [35]
普林斯顿大学最新!VLM2VLA:将 VLM 微调为 VLA,并避免灾难性遗忘
具身智能之心· 2025-10-07 10:00
Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].
IEEE TPAMI 2025 | 北京大学提出分布驱动的终身学习范式,用结构建模解决灾难性遗忘
机器之心· 2025-09-26 10:35
Core Viewpoint - The article discusses a recent research achievement in the field of artificial intelligence, specifically focusing on a new framework called DKP++ for lifelong person re-identification (LReID), which addresses the catastrophic forgetting problem in lifelong learning by enhancing memory retention of historical knowledge and improving cross-domain learning capabilities [2][3]. Research Background - Person re-identification (ReID) aims to match and associate images of the same individual across different camera views, locations, and times, with applications in surveillance, intelligent transportation, and urban safety management [3]. - The traditional ReID paradigm struggles with domain shift issues due to variations in data collection conditions, leading to inadequate adaptability in long-term dynamic environments [3]. Research Challenges - The core challenge in lifelong person re-identification is the catastrophic forgetting problem, where the model's retrieval performance for old domain data significantly decreases after learning new domain knowledge [5]. - Existing methods to mitigate forgetting, such as retaining historical samples or using knowledge distillation, face limitations related to data privacy risks, storage overhead, and model flexibility [5]. Research Motivation - The motivation behind DKP++ includes distribution-aware prototype learning to effectively retain historical knowledge without storing historical samples, and cross-domain distribution alignment to enhance the model's ability to learn new knowledge while utilizing historical information [8][10]. Method Design - DKP++ employs a distribution-aware knowledge aligning and prototyping framework, which includes: 1. Instance-level fine-grained modeling to capture local details of person instances [14]. 2. Distribution-aware prototype generation to create robust category-level prototypes that retain intra-class variation knowledge [14]. 3. Distribution alignment to bridge the feature distribution gap between new and old data [14]. 4. Prototype-based knowledge transfer to guide model learning using generated prototypes and labeled new data [14]. Experimental Analysis - The experiments utilized two typical training domain sequences and five widely used ReID datasets, evaluating the model's knowledge retention and generalization capabilities [16]. - DKP++ demonstrated an improvement of 5.2%-7% in average performance on known domains and 4.5%-7.7% in overall generalization performance on unseen domains compared to existing methods [17]. - The model showed higher historical knowledge retention and faster performance growth in unseen domains as the number of learned domains increased [20]. Technical Innovations - DKP++ introduces innovative designs focusing on distribution prototype modeling and representation, as well as sample alignment-guided prototype knowledge transfer to overcome distribution gaps between new and old domain data [23]. Future Outlook - The research suggests potential improvements in areas such as distribution alignment using larger models, active forgetting mechanisms to eliminate redundant knowledge, and multi-modal lifelong learning capabilities to enhance perception in complex environments [23].
机器情感与AI陪伴的人文审度⑥|邱德钧、李玮农:超越记忆——情感计算中遗忘的必要性和实现
Xin Lang Cai Jing· 2025-07-17 02:25
Group 1 - The year 2024 is referred to as the "Year of Humanoid Robots," with predictions that emotional communication between humans and robots will become a norm in future intelligent societies [1] - The concept of machine emotions and AI companionship raises questions about the impact on human-machine interaction and relationships, as well as cultural and gender perspectives on these emotional connections [1] - The discussions highlight the potential social impacts, technological risks, and ethical issues arising from human-robot emotional interactions, prompting interdisciplinary research [1] Group 2 - The concept of machine emotions is defined and analyzed through emotional intelligence, human-machine emotions, and human-machine interaction, advocating for a limited approach to the development of machine emotions [2] - A new perspective on endowing machines with emotional capabilities is proposed based on a life-centered consciousness theory, suggesting that simulating biological homeostasis can lead to autonomous adaptability in machines [2] - Ethical reflections on human-machine emotional interactions, particularly in the context of AI resurrection technology, reveal risks such as emotional dependency and identity crises, necessitating regulatory and cultural adjustments [2] Group 3 - The philosophical discussions in affective computing often rely on idealized technical assumptions, overlooking the importance of forgetting mechanisms in creating realistic and ethical AI emotional systems [3][4] - The current challenges in affective computing include the reliance on data quality and the superficiality of emotional expressions in AI systems, which fail to capture the complexity of human emotional experiences [6] - The introduction of forgetting mechanisms is essential for enhancing the adaptability and authenticity of emotional AI, allowing systems to discard outdated emotional data [11][12] Group 4 - The proposed phenomenology-inspired human-like forgetting neural model (PHFNM) aims to integrate individual and collective forgetting processes in emotional AI systems, reflecting both natural decay and active forgetting [19][22] - The model consists of three interconnected layers: a low-dimensional emotional index layer for natural decay, a memory encoding layer for dynamic reconstruction, and an active forgetting layer for ethical regulation [23][24][25] - The PHFNM framework emphasizes the need for a balance between individual emotional memory and collective social interactions, ensuring that emotional AI systems remain relevant and ethically responsible [26][27]