Workflow
阿里通义发布并行计算新策略:1.6B等效4.4B,内存消耗骤降95%
BABABABA(US:BABA) 量子位·2025-05-28 04:22

Core Viewpoint - The article discusses the introduction of a new scaling law for large language models (LLMs) called PARSCALE, which enhances model capabilities without significantly increasing memory and time costs [1][4]. Group 1: Model Performance and Efficiency - For a 1.6 billion parameter model, PARSCALE achieves performance close to a 4.4 billion parameter model while using only 1/22 of the memory and increasing latency by only 1/6 [2][18]. - In the GSM8K mathematical reasoning task, using P=8 leads to a 34% performance improvement for a 1.8 billion parameter model compared to the baseline, significantly surpassing the gains from parameter expansion [20]. Group 2: Technical Innovations - The new paradigm is inspired by the CFG (Conditional Free Generation) dual-path inference mechanism, which enhances model decision-making diversity and accuracy without increasing model parameters [6][11]. - PARSCALE expands the fixed dual paths of CFG into P learnable parallel paths, allowing for scalable computation through dynamic aggregation of outputs [15][29]. Group 3: Training Strategy - The training process consists of two phases: the first phase involves traditional pre-training until convergence, while the second phase freezes the main parameters and only trains the prefix embeddings and aggregation weights [23][24]. - The P=8 model shows a 34% improvement in GSM8K performance, demonstrating that a small amount of data can effectively activate parallel paths, reducing training costs by approximately 98% [25]. Group 4: Adaptability to Existing Models - The research team applied continuous pre-training and parameter-efficient fine-tuning (PEFT) on the Qwen-2.5-3B model, adjusting only the prefixes and aggregation weights [27]. - Results indicate a 15% improvement in code generation tasks (HumanEval+) using the PEFT method, confirming the feasibility of dynamically adjusting P while freezing main parameters [28].