微信、清华连续自回归模型CALM，新范式实现从「离散词元」到「连续向量」转变

Core Insights - The article discusses a new method called Continuous Autoregressive Language Model (CALM) proposed by Tencent WeChat AI and Tsinghua University, which aims to improve the efficiency of large language models (LLMs) by predicting multiple tokens as a continuous vector instead of one token at a time [3][11][12]. Group 1: Efficiency Challenges of LLMs - The efficiency issues of LLMs stem from their reliance on discrete token sequences for autoregressive prediction, leading to high computational costs and low information density per token [8][10]. - The information density of discrete tokens is low, with a 32K vocabulary size yielding only 15 bits of information per token, creating a direct bottleneck in efficiency [10][11]. - The transition from discrete to continuous representations allows for a significant reduction in the number of generation steps, enhancing computational efficiency while maintaining performance [12][21]. Group 2: Implementation of CALM - CALM employs a high-fidelity autoencoder to compress K tokens into a continuous vector, achieving over 99.9% reconstruction accuracy [11][21]. - The model's architecture includes a generative head that outputs the next continuous vector based on the hidden states from a Transformer, facilitating efficient single-step generation [24][25]. - The design of CALM allows for a more stable input signal by first decoding the predicted vector back into discrete tokens before further processing [26]. Group 3: Performance Evaluation - The Brier Score is introduced as a new evaluation metric for the model's performance, which can be estimated using Monte Carlo methods and is applicable to both traditional and new language models [29][32]. - Experimental results indicate that CALM models, such as CALM-M with 371M parameters, require significantly fewer training and inference FLOPs compared to traditional Transformer models while achieving comparable performance [37][38]. Group 4: Future Directions - The article highlights potential research directions, including enhancing the autoencoder's semantic understanding, exploring more robust end-to-end architectures, and developing efficient sampling algorithms to reduce inference costs [43][45]. - A new scaling law incorporating semantic bandwidth K is suggested as a macro-level research direction to further optimize language model efficiency [44].