Workflow
Delethink
icon
Search documents
算力成本大降,马尔可夫思考机来了,LLM推理成本直接降为线性
3 6 Ke· 2025-10-10 07:27
Core Insights - The article discusses the effectiveness and high costs of using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to limit the computational complexity associated with reasoning in LLMs by maintaining a fixed state size [4][20] Group 1: Markovian Thinker Concept - The core idea of the Markovian Thinker is to reconstruct the components of reinforcement learning so that the effective state size remains bounded regardless of the total thinking length [4] - This approach allows longer reasoning processes to require only linear computational resources and constant memory, decoupling the duration of model thinking from the amount of context it must handle [4][20] Group 2: Delethink Implementation - Delethink is a reinforcement learning environment that organizes the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [4][9] - The implementation of Delethink results in linear scaling for both the generation and backpropagation phases, contrasting with the quadratic scaling seen in traditional LongCoT environments [6][15] Group 3: Experimental Results - Experiments show that even with an 8K chunk size, the DeepSeek R1-Distill 1.5B model trained with Delethink can reason up to 24K tokens, outperforming LongCoT-RL in mathematical benchmark tests [9][12] - The model achieved 49% accuracy on a 96K token reasoning task with minimal additional training steps, demonstrating significant efficiency improvements [14][15] Group 4: Implications for Future Models - The success of the Markovian Thinker indicates that decoupling thinking length from context size could enable next-generation reasoning models to handle millions of tokens effectively [20] - The findings suggest that non-quadratic complexity sequence architectures may greatly benefit reasoning models, as the thinking process can be effectively transformed into a Markovian style [20]
算力成本大降!马尔可夫思考机来了,LLM推理成本直接降为线性
机器之心· 2025-10-10 06:36
Core Insights - The article discusses the effectiveness and high costs associated with using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to prevent quadratic growth in computational requirements by maintaining a fixed state size during reasoning [3][9] Group 1: Markovian Thinker - The Markovian Thinker redefines the structure of reinforcement learning to ensure that the effective state size remains bounded regardless of the total thinking length, leading to linear computational requirements [9][32] - The Delethink framework exemplifies this approach by organizing the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [10][12] Group 2: Performance and Efficiency - Experiments show that the Delethink framework allows models to think up to 24K tokens with significant performance improvements over traditional LongCoT methods, even achieving 49% accuracy on complex tasks with 96K tokens [20][23][26] - The computational efficiency of Delethink is highlighted, requiring only 7 H100-months for training compared to 27 H100-months for LongCoT-RL at an average thinking length of 94K tokens [26] Group 3: Implications for Future Models - The success of the Markovian Thinker suggests that decoupling thinking length from context size could enable future reasoning models to handle millions of tokens effectively [32][33] - The findings indicate that non-quadratic complexity architectures may significantly benefit reasoning models, allowing for more efficient processing of thought sequences [33]