Dense、MoE之外第三条Scaling路径：交大提出JTok模块，省1/3算力

Core Insights - The article discusses the limitations of traditional scaling methods in large models, emphasizing the need for new approaches to decouple parameters from computational costs [2][4][19] - It introduces the JTok and JTok-M modules, which utilize token-indexed parameters to enhance model capacity without significantly increasing computational requirements [3][5][10] - The findings suggest that JTok-M can achieve substantial performance improvements while reducing computational costs by approximately 35% [5][24][26] Summary by Sections Traditional Scaling Limitations - Traditional scaling methods bind parameters and computational requirements, leading to linear increases in both as model size grows [2][19] - The MoE (Mixture of Experts) approach, while promising, has drawbacks such as lower sample efficiency and increased memory and communication overhead [2][3] Introduction of JTok and JTok-M - JTok introduces a new scaling dimension by using modulation vectors for each token, allowing for enhanced model capacity without additional computational costs [3][10] - JTok-M further refines this by incorporating context-aware dynamic modulation, improving performance while maintaining efficiency [14][16] Performance and Efficiency Gains - JTok-M has shown significant performance improvements across various tasks, with notable increases in accuracy for models ranging from 650M to 61B parameters [5][39] - The approach allows for a reduction in computational requirements while achieving similar or better performance compared to traditional models [5][26][44] Theoretical Framework and Validation - The article presents a theoretical framework that integrates JTok-M into existing scaling laws, demonstrating its potential to shift the performance-computation curve downward [24][25] - Empirical results confirm that JTok-M maintains stable performance gains across different model sizes and training budgets, validating its scalability [26][29] Practical Applications and Future Directions - JTok and JTok-M have been tested across various downstream tasks, showing improvements in knowledge retention, reasoning, and mathematical problem-solving capabilities [35][39] - The innovations presented in JTok-M represent a significant step forward in redefining scaling laws for large models, offering a sustainable path for future developments in the field [34][32]