Test-Time Scaling - filings, earnings calls, financial reports, news

Test-Time Scaling

Search documents

复旦北大联合美团LongCat提出TDAR：用“粗思考，细求证”破解Block Diffusion的速度精度悖论

机器之心· 2026-03-12 09:30

Core Insights - Test-Time Scaling has become a key pathway to enhance model inference capabilities, with Block Diffusion Language Models (BDLMs) emerging as strong competitors to traditional autoregressive models due to their unique parallel decoding abilities [2] - Existing BDLMs face a dilemma in efficiency versus effectiveness during long-chain reasoning, where large block decoding is fast but prone to errors, while small block decoding is accurate but loses parallel advantages [2][12] - A new framework called TDAR has been proposed by Fudan University and Peking University, which introduces the "Think Coarse, Critic Fine" (TCCF) paradigm and Bounded Adaptive Confidence Decoding (BACD) to break the trade-off between speed and accuracy [2][6] Summary by Sections TDAR Framework - The TDAR framework includes two core designs: Bounded Adaptive Confidence Decoding (BACD) and the TCCF paradigm, aimed at addressing the efficiency and accuracy challenges in long-chain reasoning [6][11] Bounded Adaptive Confidence Decoding (BACD) - BACD dynamically adjusts the denoising threshold based on the average confidence of generated tokens, incorporating upper and lower bounds to balance aggressive acceleration and conservative measures during uncertain steps [9][20] TCCF Paradigm - The TCCF paradigm differentiates between exploration and verification phases in long-chain reasoning, allowing for coarse thinking during exploration and fine verification during validation, thus optimizing computational granularity [11][15] Experimental Results - TDAR-8B-Thinking achieved superior performance across six mainstream reasoning benchmarks, surpassing the previous state-of-the-art model TraDo-8B by 3.4 percentage points, with decoding speed increasing from 1.27 TPF to 2.97 TPF [13][16] - With the integration of BACD, speed further improved to 3.37 TPF, and accuracy increased by 1.6 percentage points; the TCCF paradigm led to a significant accuracy boost from 36.3% to 42.9% on complex tasks while maintaining a high speed of 3.04 TPF [13][16] Performance Analysis - The research team conducted a multi-dimensional analysis of the performance sources of TDAR, focusing on the impact of block size, decoding strategies, and the TCCF paradigm [17][18] - BACD demonstrated superior stability compared to traditional decoding methods, effectively avoiding issues like model collapse and repeated generation [19][20] - The analysis identified a sweet spot for block size at 16 for the 8B model, balancing speed and quality through progressive training [23][26] Conclusion and Future Outlook - The introduction of TDAR marks a significant advancement for BDLMs in complex reasoning tasks, allowing for large block sizes to maintain quality and speed [31][32] - TDAR provides an efficient solution for Test-Time Scaling in BDLMs and offers new insights for the design of future parallel reasoning models [32]

杨植麟带 Kimi 团队深夜回应：关于 K2 Thinking 爆火后的一切争议

AI前线· 2025-11-11 06:42

Core Insights - The article discusses the launch of Kimi K2 Thinking by Moonshot AI, highlighting its capabilities and innovations in the AI model landscape [2][27]. - Kimi K2 Thinking has achieved impressive results in various global AI benchmarks, outperforming leading models like GPT-5 and Claude 4.5 [10][12]. Group 1: Model Performance - Kimi K2 Thinking excelled in benchmarks such as HLE and BrowseComp, surpassing GPT-5 and Claude 4.5, showcasing its advanced reasoning capabilities [10][12]. - In the AIME25 benchmark, Kimi K2 Thinking scored 99.1%, nearly matching GPT-5's 99.6% and outperforming DeepSeek V3.2 [12]. - The model's performance in coding tasks was notable, achieving scores of 61.1%, 71.3%, and 47.1% in various coding benchmarks, demonstrating its capability in software development [32]. Group 2: Innovations and Features - Kimi K2 Thinking incorporates a novel KDA (Kimi Delta Attention) mechanism, which enhances long-context consistency and reduces memory usage [15][39]. - The model is designed as an "Agent," capable of autonomous planning and execution, allowing it to perform 200-300 tool calls without human intervention [28][29]. - The architecture allows for a significant increase in reasoning depth and efficiency, balancing the need for speed and accuracy in complex tasks [41]. Group 3: Future Developments - The team is working on a visual language model (VL) and plans to implement improvements based on user feedback regarding the model's performance [18][20]. - Kimi K3 is anticipated to build upon the innovations of Kimi K2, with the KDA mechanism likely to be retained in future iterations [15][18]. - The company aims to address the "slop problem" in language generation, focusing on enhancing emotional expression and reducing overly sanitized outputs [25].