算力成本大降！马尔可夫思考机来了，LLM推理成本直接降为线性

Core Insights - The article discusses the effectiveness and high costs associated with using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to prevent quadratic growth in computational requirements by maintaining a fixed state size during reasoning [3][9] Group 1: Markovian Thinker - The Markovian Thinker redefines the structure of reinforcement learning to ensure that the effective state size remains bounded regardless of the total thinking length, leading to linear computational requirements [9][32] - The Delethink framework exemplifies this approach by organizing the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [10][12] Group 2: Performance and Efficiency - Experiments show that the Delethink framework allows models to think up to 24K tokens with significant performance improvements over traditional LongCoT methods, even achieving 49% accuracy on complex tasks with 96K tokens [20][23][26] - The computational efficiency of Delethink is highlighted, requiring only 7 H100-months for training compared to 27 H100-months for LongCoT-RL at an average thinking length of 94K tokens [26] Group 3: Implications for Future Models - The success of the Markovian Thinker suggests that decoupling thinking length from context size could enable future reasoning models to handle millions of tokens effectively [32][33] - The findings indicate that non-quadratic complexity architectures may significantly benefit reasoning models, allowing for more efficient processing of thought sequences [33]