大模型推理能力

Search documents
监督学习未死,一题训练五小时起飞!华人学者新方法20倍训练效率释放大模型推理能力
量子位· 2025-08-04 07:00
Core Viewpoint - The article discusses the breakthrough of One-Shot Critique Fine-Tuning (One-Shot CFT) in enhancing reasoning capabilities of large language models (LLMs) with minimal data and computational resources, outperforming traditional reinforcement learning (RL) methods and small-scale supervised fine-tuning (SFT) approaches [1][3][14]. Group 1: One-Shot CFT Methodology - One-Shot CFT is a new method that allows models to learn reasoning by analyzing the quality of answers rather than merely imitating them, thus providing a deeper learning signal [3][12]. - The process involves selecting a representative task, generating multiple answers using various models, and then having a more powerful model critique these answers, which serves as the supervision signal for training [4][5]. - The entire training process requires only one question, multiple answers, and critiques, taking approximately 5 GPU hours, significantly less than RL methods [5][14]. Group 2: Performance and Results - In experiments, Qwen2.5-Math-7B achieved a 15% accuracy increase after One-Shot CFT fine-tuning on a single question, surpassing both RL and full supervised fine-tuning models that used tens of thousands of training samples [9][10]. - The method demonstrated strong performance across various mathematical and logical reasoning tasks, with accuracy improvements ranging from 10% to 16% in specific sub-tasks [10][11]. - One-Shot CFT showed stability and reproducibility across different tasks and model configurations, indicating its robustness [11][13]. Group 3: Advantages of One-Shot CFT - The method emphasizes critical learning, allowing models to understand why answers are correct or incorrect, which enhances the depth of learning compared to traditional SFT [12]. - It introduces multi-perspective inputs by generating multiple answers and critiques for a single task, closely mimicking human learning processes [12]. - The training signals from critiques are highly generalizable, reducing the risk of overfitting and allowing for easier transfer to new tasks [12]. Group 4: Accessibility and Practical Implications - One-Shot CFT's low computational cost makes it accessible for individual researchers, resource-limited labs, and startups, providing a cost-effective solution for enhancing reasoning capabilities [14][15]. - The entire process is open-source, including training scripts, model parameters, and datasets, which significantly lowers the barrier for replication and experimentation [17].
腾讯研究院AI速递 20250703
腾讯研究院· 2025-07-02 15:52
Group 1 - Cursor's developer Anysphere has poached two key figures, Boris Cherny and Cat Wu, from Claude Code, despite their close partnership [1] - Anthropic's annual revenue has reached $4 billion with a valuation of $61.5 billion, and its Claude model is regarded as the best programming model [1] - Anysphere's revenue has doubled within three months to an annual income of $500 million, with a valuation of $9.9 billion, intensifying competition in the AI programming market [1] Group 2 - Zhizhu has released the open-source GLM-4.1V-Thinking visual reasoning model, which surpasses an 8x parameter 72B model in 18 authoritative evaluations [2] - The model architecture integrates ViT visual encoders, MLP adapters, and GLM language decoders, enhancing processing capabilities with 2D-RoPE and 3D-RoPE positional encodings [2] - The training process consists of four stages: multi-modal pre-training, long-context continuous training, supervised fine-tuning, and curriculum sampling reinforcement learning, significantly improving logical reasoning abilities [2] Group 3 - Sakana AI has introduced the Adaptive Branch Monte Carlo Tree Search (AB-MCTS) algorithm, enhancing large model reasoning capabilities through flexible dual-directional search [3] - The Multi-LLM AB-MCTS system allows multiple cutting-edge models (Gemini 2.5 Pro, o4-mini, DeepSeek-R1-0528) to collaborate, achieving a 30% performance improvement on the ARC-AGI-2 benchmark test [3] - This algorithm dynamically selects the optimal model based on the problem, enabling collective intelligence to surpass the limitations of individual models, with the underlying framework TreeQuest open-sourced for user applications [3] Group 4 - HeyGen has launched a "product placement" feature that generates realistic promotional videos by simply uploading a character's avatar and product images, with Elon Musk promoting Labubu as a notable case [4] - Founded by two alumni from Tongji University, HeyGen is valued at $500 million with an annual revenue nearing $80 million, expected to surpass $100 million [5] - Compared to competitors like Topview, HeyGen excels in model expression naturalness and lip-sync accuracy, offering unlimited short video production for a monthly fee of $29 [5] Group 5 - Baidu has undergone its most significant self-revolution in nearly a decade by upgrading its search function to an AI smart box that supports ultra-long text, while still retaining the traditional search mode [6] - The introduction of the "Bai Kan" feature innovates the way search results are displayed, prioritizing the most useful rich media content such as video explanations and intelligent summaries [6] - The search functionality has evolved from simple information retrieval to task delivery, allowing users to obtain ratings, locations, and travel plans directly, even supporting one-click taxi booking or package purchases [6] Group 6 - Microsoft has released the MAI-DxO medical AI system, which boasts an accuracy rate of 85.5%, outperforming a professional doctor with 10 years of experience by four times [7] - MAI-DxO simulates a real medical team's sequential diagnostic process through collaboration among five virtual doctor roles [7] - The system offers five diagnostic modes to meet various scenario needs and has introduced a professional medical sequential diagnostic benchmark, SDBench, featuring 304 challenging diagnostic cases [7] Group 7 - Baidu has launched its self-developed multi-modal generative large model MuseSteamer and the "Hui Xiang" platform, supporting high-quality video generation at resolutions from 720p to 1080p, setting a new record on the VBench-I2V video generation leaderboard [8] - The model is available in four versions: Lite (720p fast speed), Turbo (720p excellent character motion), Pro (1080p cinematic quality), and Voice (automatically generates sound effects and dialogue), catering to different creative needs [8] - Key technological highlights include precise understanding of Chinese semantics, structured video description language, cinematic dynamic beauty generation, and integrated audio-video generation, already applied in advertising creativity and short drama production [8] Group 8 - Cloudflare has introduced the "Pay Per Crawl" experimental feature, allowing websites to set permissions, fees, or blocks for AI crawlers, granting content creators bargaining power over their content [10] - Data indicates a significant disparity between AI crawlers and traditional search engines: Google returns one click for every 6-7 crawls, while OpenAI requires 1,500 crawls and Anthropic 73,300 crawls for a single click, disrupting the existing ecological balance [10] - This feature implements fee control through HTTP 402 status codes and digital signature authentication mechanisms, currently in beta testing, potentially creating a new monetization model for internet content creators from "advertising monetization" to "content licensing monetization" [10] Group 9 - Chai Discovery, supported by OpenAI, has launched the Chai-2 multi-modal generative model, achieving a 16% hit rate in de novo antibody design, improving over 100 times compared to previous SOTA technologies [11] - Chai-2 can identify effective antibodies for 26 out of 52 test targets (50%) within a 24-well plate (≤20 designs) and can generate various forms of sequences, including scFv antibodies, VHH domains, and mini-binding sites [11] - The model employs a controllable model-driven framework, reducing the development cycle from months to two weeks, achieving a 68% success rate in wet lab experiments for micro-protein design, potentially unlocking drug development capabilities beyond traditional technologies [11] Group 10 - The New Yorker highlights that AI teaches humans to write "good" articles but causes truly good articles to disappear [12] - The article points out that AI is reconstructing culture with an "average" logic, leading to standardization and loss of uniqueness in writing, with MIT experiments showing a significant reduction in brain activity levels among students using ChatGPT for writing [12] - Research indicates that AI leads to cultural homogenization, with Cornell University experiments confirming that AI-assisted writing styles of users from India and the US converge towards a "Western paradigm," with common references to pizza and Christmas [12]
o3-pro答高难题文字游戏引围观,OpenAI前员工讽刺苹果:这都不叫推理那什么叫推理
量子位· 2025-06-13 02:25
Core Viewpoint - OpenAI's latest reasoning model, o3-pro, demonstrates strong reasoning capabilities but has mixed performance in various evaluations, indicating a need for context and specific prompts to maximize its potential [1][2][3][4]. Evaluation Results - o3-pro achieved a correct answer in 4 minutes and 25 seconds during a reasoning test, showcasing its ability to process complex queries [2]. - In official evaluations, o3-pro surpassed previous models like o3 and o1-pro, becoming the best coding model from OpenAI [8]. - However, in the LiveBench ranking, o3-pro showed only a slight advantage over o3 with a score difference of 0.07, and it lagged behind o3 in agentic coding scores (31.67 vs 36.67) [11]. Contextual Performance - o3-pro excels in short context scenarios, showing improvement over o3, but struggles with long context processing, scoring 65.6 compared to Gemini 2.5 Pro's 90.6 in 192k context tests [15][16]. - The model's performance is highly dependent on the background information provided, as noted by user experiences [24][40]. User Insights - Bindu Reddy, a former executive at Amazon and Google, pointed out that o3-pro lacks proficiency in tool usage and agent capabilities [12]. - Ben Hylak, a former engineer at Apple and SpaceX, emphasized that o3-pro's effectiveness increases significantly when treated as a report generator rather than a chat model, requiring ample context for optimal results [22][24][26]. Comparison with Other Models - Ben Hylak found o3-pro's outputs to be superior to those of Claude Opus and Gemini 2.5 Pro, highlighting its unique value in practical applications [39]. - The model's ability to understand its environment and accurately describe tool usage has improved, making it a better coordinator in tasks [30][31]. Conclusion - The evaluation of o3-pro reveals that while it has advanced reasoning capabilities, its performance is contingent on the context and prompts provided, necessitating a strategic approach to maximize its utility in various applications [40][41].
DeepSeekR1幻觉率最高降低50%,用户喊话想要R2模型
Di Yi Cai Jing· 2025-05-29 14:10
Core Insights - The updated R1 model from DeepSeek has significantly improved its capabilities, particularly in reducing the "hallucination" rate, which previously stood at around 21% [1][4]. Model Performance - The new R1 model has achieved top-tier performance in various benchmark tests, surpassing all domestic models and nearing the performance of international leaders like o3 and Gemini-2.5-Pro [4]. - The hallucination rate has been reduced by approximately 45%-50% in tasks such as rewriting, summarization, and reading comprehension, providing more accurate and reliable results [4][18]. - In the AIME 2025 test, the model's accuracy improved from 70% to 87.5% in complex reasoning tasks [18]. Model Features and Capabilities - The updated R1 model can generate longer and more structured pieces of writing, including essays, novels, and prose, while aligning more closely with human writing styles [18]. - The model's coding capabilities have also seen significant enhancements, performing nearly on par with OpenAI's o3-high model in code testing environments [18]. - The new model has a parameter count of 685 billion and supports a context length of 128K in the open-source version [19]. Future Developments - There is considerable anticipation in the industry for the next-generation R2 model, with users expressing their eagerness for its release [19]. - DeepSeek has not commented on speculations regarding the R2 model, but the ongoing competition in the foundational model space remains intense [19].
大模型玩不好数独?!Transformer作者初创公司公布排行榜:o3 Mini High“变异数独”正确率仅2.9%
量子位· 2025-05-28 04:22
Core Insights - The article discusses the performance of AI models in solving Sudoku puzzles, revealing that the overall accuracy is only 15%, with the best model achieving just 2.9% accuracy on 9x9 puzzles [1][25]. Group 1: AI Model Performance - Sakana AI introduced a new benchmark called Sudoku-Bench, which tests AI models on various Sudoku puzzles ranging from 4x4 to 9x9 [1][6]. - The leaderboard shows that the top-performing model, O3 Mini High, solved 14% of puzzles, while other models like Gemini 2.5 Pro and Qwen 3 235B A22B achieved 11% and 8% respectively [2][22]. - Even the most advanced models struggled, with many failing to place even a single correct number in the puzzles [21][25]. Group 2: Challenges Faced by AI Models - A significant issue identified is the "memory dependence" of large models, where they rely on memorized solutions rather than logical reasoning [7][8]. - Models often fail to adapt to new rules or unseen patterns, leading to ineffective problem-solving strategies [9][10]. - Traditional Sudoku puzzles may be too simplistic for these models, as they tend to memorize patterns instead of developing creative problem-solving skills [10]. Group 3: Innovative Testing Approach - The Sudoku-Bench includes "variant Sudoku" puzzles that require multi-step reasoning and cannot be solved through memory alone, making them ideal for testing AI reasoning capabilities [11][12]. - The benchmark features both traditional and modern Sudoku problems, with varying difficulty levels [15][16]. Group 4: Company Background - Sakana AI was founded in July 2023 by former Google researchers Llion Jones and David Ha, focusing on generative AI models [24]. - The company has previously released AI models capable of generating academic papers and reviewing AI-generated content [26][29].
清华学霸与AI比做高考压轴题,谁会赢?
第一财经· 2025-05-27 15:21
Core Viewpoint - The article discusses the significant advancements in AI's reasoning capabilities, particularly in the context of education, as demonstrated by a competition between top students from Tsinghua University and AI models in solving challenging exam questions [1][2]. Group 1: AI Advancements - The AI model DeepSeek-R1 has led to a breakthrough in reasoning capabilities, showing high adaptability in educational scenarios and improving the quality of guidance and Q&A [2]. - In a recent test, AI achieved a score of 697 out of 750 on a new high school exam, indicating a performance level comparable to top-tier universities [2]. - The performance of AI models in mathematics has been a focus, with OpenAI's o3-mini demonstrating superior reasoning capabilities in the FrontierMath benchmark [3]. Group 2: Educational Impact - The AI's ability to solve high-difficulty math problems has garnered attention, especially in the context of national exams, which are widely recognized for their difficulty [2]. - The online education market is projected to see an increase in AI's contribution from 7% to 16% between 2023 and 2027, highlighting the growing integration of AI in educational settings [3].