ParScale

Search documents
博士宿舍激情脑暴,革新了Scaling Law?Qwen和浙大联手推出新定律,直接干掉95.5%推理内存!
AI前线· 2025-05-21 10:04
Core Viewpoint - Alibaba's research team, in collaboration with Zhejiang University, has proposed a new Scaling Law called Parallel Scaling Law (ParScale), which enhances the capabilities of large models during training and inference by increasing parallel computation without adding model parameters, resulting in higher inference efficiency [1][3][19]. Summary by Sections Introduction of ParScale - ParScale allows for the deployment of more powerful models in low-resource scenarios by reusing existing parameters to expand parallel computation, applicable to any model structure, optimization process, data, or task [1][19]. - The memory increase from ParScale is only 4.5% compared to parameter scaling, while the latency increase is 16.7% [1][19]. Comparison with Traditional Scaling Methods - Traditional scaling methods include parameter expansion and inference-time scaling, both of which have significant resource demands [3][4]. - ParScale introduces multiple parallel streams during training and inference, converting a single input into multiple inputs for forward propagation, which are then combined into a single output [5][10]. Implementation of ParScale - The implementation involves three steps: diversifying input transformations, parallel processing, and dynamic aggregation of outputs [13]. - A two-stage post-training strategy is employed to manage the increased training costs due to the number of parallel streams, significantly reducing overall training costs while maintaining performance gains [12][14]. Performance Metrics - As the number of parallel streams (P) increases, model performance improves across various benchmarks, particularly in tasks requiring strong reasoning abilities [15][16]. - For instance, with P increased to 8, the model showed a 4.3% improvement in coding tasks, a 7.3% improvement in math tasks, and a 10% improvement on the GSM8K benchmark [15]. Application and Future Prospects - ParScale is particularly suitable for edge devices like smartphones, cars, and robots, where memory resources are limited [17][19]. - The research team plans to explore ParScale's application in more model architectures and larger datasets, indicating its potential to complement existing methods like MoE architectures [19].