Workflow
ReSTEM
icon
Search documents
0人工参与实现梯度更新!MIT新框架让AI自动生成微调数据,权重自主升级
量子位· 2025-10-14 04:08
Core Viewpoint - The article discusses a new reinforcement learning framework called SEAL (Self-Adapting LLMs) developed by MIT, which enables large models to autonomously update their weights and learn new knowledge without human intervention [1][4][6]. Group 1: SEAL Framework Overview - SEAL employs a nested learning mechanism that consists of an external loop driven by reinforcement learning and an internal loop for parameter updates [4][26]. - The framework allows models to generate fine-tuning data and self-update instructions, thus overcoming the limitations of relying solely on external supervised data [6][25]. Group 2: Knowledge Incorporation Experiment - In the knowledge incorporation experiment, the Qwen2.5-7B model was tested using the SQuAD dataset, where it generated training data based on new paragraphs without seeing the corresponding questions [9][10]. - The accuracy of the model improved from 32.7% to 47.0% when using SEAL for fine-tuning, outperforming both original and GPT-4.1 generated data [14][15]. - SEAL demonstrated a significant accuracy of 58.2% when tested with longer paragraphs, indicating its ability to generalize to larger data organization tasks [16]. Group 3: Few-Shot Learning Experiment - In the few-shot learning experiment, the LLaMA-3.2-1B-Instruct model was evaluated using a subset of tasks from the ARC-AGI dataset [17][18]. - SEAL achieved a success rate of 72.5%, significantly higher than the 0% success rate of fixed few-shot prompts and 20% from random sampling strategies [22][23]. - Although SEAL's performance did not reach the optimal strategy (Oracle TTT) at 100%, it showcased strong task adaptability through self-discovered learning paths [22]. Group 4: Mechanism of SEAL - SEAL's process involves reading new information, rewriting it in its own language, and performing gradient updates for autonomous learning [25]. - The model generates self-edit instructions that describe how to update itself based on the current input, including information extraction and training parameters [28][29]. - The framework utilizes a non-traditional reinforcement learning method called ReSTEM, which focuses on behavior cloning and filtered sampling to optimize self-edit strategies [33][36].