Workflow
RoPE
icon
Search documents
把RoPE扔掉,AI更能看懂长上下文,Transformer作者团队开源大模型预训练新方法
3 6 Ke· 2026-01-13 11:01
Core Insights - A new technology called DroPE has been developed by a research team led by Llion Jones, one of the core authors of the Transformer architecture, to address the challenges of long text processing in large models [1][14] - DroPE allows for seamless zero-shot context expansion without the need for expensive long-context training, requiring less than 1% of the pre-training budget for model recalibration [1][10] Technology Overview - DroPE can be seen as a method that discards positional embeddings to extend context, humorously referred to as "NoRoPE" by netizens [3] - The technology utilizes RoPE (Rotary Positional Encoding) as a temporary training tool during the pre-training phase to ensure stability and efficiency, while discarding positional embeddings during the inference phase [8][5] Performance Metrics - Experiments conducted on various models, including a 5M parameter model and the SmolLM family (360M/1.7B) as well as the 7B parameter Llama2-7B, showed that DroPE improved the average score of the base SmolLM by over 10 times on the LongBench benchmark [10] - In the NIAH task evaluation, the recall rate of the DroPE model reached 74.92%, significantly surpassing traditional RoPE scaling methods [10] Comparative Analysis - Performance comparisons across different methods indicate that DroPE outperforms other techniques in various tasks, achieving an average score of 30.52 in the LongBench benchmark [11] - Even with only 0.5% of the pre-training budget used for recalibration, DroPE demonstrated exceptional performance in long-context question answering and summarization tasks [11] Company Background - The team behind DroPE is from Sakana AI, co-founded by Llion Jones and former Google senior scientist David Ha, and has gained attention for creating the first AI scientist capable of producing complete academic papers [14][16] - Sakana AI has also collaborated with MIT researchers to propose the Digital Red Queen algorithm, showcasing the potential of large language models in adversarial program evolution [18]
LSTM之父率队造出PoPE:终结RoPE泛化难题,实现Transformer的极坐标进化
机器之心· 2026-01-02 01:55
Core Viewpoint - The article discusses a new approach called Polar Coordinate Position Embedding (PoPE) that addresses the limitations of the existing Rotational Position Embedding (RoPE) method in Transformer architectures, particularly in decoupling content and positional information for improved model performance [1][2]. Group 1: RoPE Issues - RoPE entangles content and position information, which can degrade model performance, especially in tasks requiring independent matching of these factors [1][4]. - In various advanced models, RoPE is the preferred method for incorporating positional information, but it struggles with tasks that require clear separation of content and position [5][19]. Group 2: PoPE Solution - PoPE eliminates the confusion between content and position, leading to significantly better performance in diagnostic tasks that require indexing based solely on either content or position [2][10]. - The attention score in PoPE is defined using a different approach that allows for the decoupling of content and position, enhancing model learning efficiency [12][13]. Group 3: Performance Comparison - In indirect indexing tasks, PoPE achieved an average accuracy of 94.82%, while RoPE only reached 11.16%, demonstrating PoPE's superior ability to separate content and positional information [18][19]. - In music and genomic sequence modeling, PoPE outperformed RoPE with lower negative log likelihood (NLL) values across various datasets [20][22]. - In language modeling on the OpenWebText dataset, PoPE consistently showed lower perplexity across all model sizes compared to RoPE [25][26]. Group 4: Generalization and Stability - PoPE exhibits strong extrapolation capabilities without requiring fine-tuning or interpolation, maintaining stability in performance even as model size increases, unlike RoPE [31][32].