Positional Embedding
Search documents
把RoPE扔掉,AI更能看懂长上下文,Transformer作者团队开源大模型预训练新方法
3 6 Ke· 2026-01-13 11:01
Core Insights - A new technology called DroPE has been developed by a research team led by Llion Jones, one of the core authors of the Transformer architecture, to address the challenges of long text processing in large models [1][14] - DroPE allows for seamless zero-shot context expansion without the need for expensive long-context training, requiring less than 1% of the pre-training budget for model recalibration [1][10] Technology Overview - DroPE can be seen as a method that discards positional embeddings to extend context, humorously referred to as "NoRoPE" by netizens [3] - The technology utilizes RoPE (Rotary Positional Encoding) as a temporary training tool during the pre-training phase to ensure stability and efficiency, while discarding positional embeddings during the inference phase [8][5] Performance Metrics - Experiments conducted on various models, including a 5M parameter model and the SmolLM family (360M/1.7B) as well as the 7B parameter Llama2-7B, showed that DroPE improved the average score of the base SmolLM by over 10 times on the LongBench benchmark [10] - In the NIAH task evaluation, the recall rate of the DroPE model reached 74.92%, significantly surpassing traditional RoPE scaling methods [10] Comparative Analysis - Performance comparisons across different methods indicate that DroPE outperforms other techniques in various tasks, achieving an average score of 30.52 in the LongBench benchmark [11] - Even with only 0.5% of the pre-training budget used for recalibration, DroPE demonstrated exceptional performance in long-context question answering and summarization tasks [11] Company Background - The team behind DroPE is from Sakana AI, co-founded by Llion Jones and former Google senior scientist David Ha, and has gained attention for creating the first AI scientist capable of producing complete academic papers [14][16] - Sakana AI has also collaborated with MIT researchers to propose the Digital Red Queen algorithm, showcasing the potential of large language models in adversarial program evolution [18]