Workflow
Ouro循环语言模型
icon
Search documents
字节Seed团队发布循环语言模型Ouro,在预训练阶段直接「思考」,Bengio署名
机器之心· 2025-11-04 03:45
Core Insights - The article discusses the introduction of Ouro, a new type of pre-trained model known as Looped Language Models (LoopLM), developed by ByteDance's Seed team in collaboration with several institutions. Ouro aims to enhance reasoning capabilities by integrating them directly into the pre-training phase rather than relying solely on post-training fine-tuning [1][6]. Group 1: Model Architecture and Design - Ouro employs iterative computation in latent space, utilizes an entropy regularization objective for learning deep distributions, and expands its training data to 7.7 trillion tokens, allowing for direct learning of reasoning capabilities during pre-training [1][6]. - The LoopLM architecture is inspired by the "universal Transformer" and features a stack of N shared weight layers that are applied multiple times in a forward pass, enabling dynamic computation within a fixed parameter budget [10]. - The architecture includes an adaptive computation mechanism with a learned "exit gate" that allows the model to terminate processing early for simpler inputs, thus optimizing computational resources [10][15]. Group 2: Performance and Efficiency - Ouro's models, with 1.4 billion and 2.6 billion parameters, achieve performance comparable to standard Transformers with 4 billion and 8 billion parameters, demonstrating a 2-3 times improvement in parameter efficiency [6][8]. - In advanced reasoning benchmark tests, the Ouro-Thinking series models perform on par with or exceed the performance of larger baseline models, showcasing their effectiveness in mathematical and scientific datasets [8]. Group 3: Training Process - The training process for Ouro is multi-staged, utilizing a total of 7.7 trillion tokens, starting with a general warm-up phase followed by an initial stable training phase using 3 trillion tokens [12][13]. - Both parameter variants (1.4B and 2.6B) undergo four subsequent training stages, including a second stable training phase, CT annealing, long context training, and mid-training, culminating in a specialized reasoning supervision fine-tuning phase for the Ouro-Thinking models [13][15]. - The training stability was enhanced by adjusting the number of loop steps from 8 to 4 to balance computational depth and stability [13].