大模型推理能力

Search documents
股指期货将偏强震荡,黄金、白银、PVC期货将偏强震荡,原油、天然橡胶期货将偏弱震荡
Guo Tai Jun An Qi Huo· 2025-09-19 02:17
Report Industry Investment Rating No relevant content provided. Core View of the Report Through macro - fundamental analysis and technical analysis such as the golden section line, horizontal line, and moving average, the report predicts the trend of today's futures main - contract prices. It expects that stock index futures, gold, silver, and PVC futures will have a strong - side shock, while crude oil and natural rubber futures will have a weak - side shock. Ten - year and thirty - year treasury bond futures will have a wide - range shock, and other futures will either have a shock consolidation or a wide - range shock [1][2][3][4][5]. Summary by Related Catalogs 1. Futures Market Forecast - **Stock Index Futures**: On September 19, it is expected to have a strong - side shock. For IF2512, resistance levels are 4495 and 4550 points, and support levels are 4413 and 4382 points; for IH2512, resistance levels are 2955 and 2981 points, and support levels are 2873 and 2855 points; for IC2512, resistance levels are 7150 and 7200 points, and support levels are 6881 and 6785 points; for IM2512, resistance levels are 7316 and 7460 points, and support levels are 7131 and 7086 points [2]. - **Treasury Bond Futures**: On September 19, the ten - year treasury bond futures main contract T2512 is likely to have a wide - range shock, with resistance levels at 108.17 and 108.32 yuan, and support levels at 107.97 and 107.91 yuan. The thirty - year treasury bond futures main contract TL2512 is also likely to have a wide - range shock, with resistance levels at 116.1 and 116.5 yuan, and support levels at 115.2 and 115.0 yuan [3][35][40]. - **Precious Metal Futures**: On September 19, the gold futures main contract AU2512 is likely to have a strong - side shock, with resistance levels at 834.0 and 838.1 yuan/gram, and support levels at 824.9 and 820.0 yuan/gram. The silver futures main contract AG2512 is also likely to have a strong - side shock, with resistance levels at 9944 and 9999 yuan/kilogram, and support levels at 9835 and 9799 yuan/kilogram [3][43][47]. - **Base Metal Futures**: On September 19, the copper futures main contract CU2511 is likely to have a shock consolidation, with support levels at 79400 and 79200 yuan/ton, and resistance levels at 79900 and 80000 yuan/ton. The aluminum futures main contract AL2511 is likely to have a shock consolidation, with support levels at 20700 and 20650 yuan/ton, and resistance levels at 20920 and 21000 yuan/ton. The alumina futures main contract AO2601 is likely to have a shock consolidation, with resistance levels at 2960 and 2989 yuan/ton, and support levels at 2919 and 2900 yuan/ton [3][50][55]. - **Energy and Chemical Futures**: On September 19, the crude oil futures main contract SC2511 is likely to have a weak - side shock, with support levels at 485 and 480 yuan/barrel, and resistance levels at 497 and 500 yuan/barrel. The PVC futures main contract V2601 is likely to have a strong - side shock and will attack the resistance levels of 4975 and 5000 yuan/ton, with support levels at 4923 and 4891 yuan/ton. The natural rubber futures main contract RU2601 is likely to have a weak - side shock and will test the support levels of 15330 and 15210 yuan/ton, with resistance levels at 15670 and 15750 yuan/ton [4][5][92][97][99]. - **Building Materials and Steel Futures**: On September 19, the rebar futures main contract RB2601 is likely to have a shock consolidation, with support levels at 3123 and 3101 yuan/ton, and resistance levels at 3166 and 3180 yuan/ton. The hot - rolled coil futures main contract HC2601 is likely to have a weak - side shock, with support levels at 3335 and 3314 yuan/ton, and resistance levels at 3370 and 3388 yuan/ton. The iron ore futures main contract I2601 is likely to have a wide - range shock, with resistance levels at 809 and 815 yuan/ton, and support levels at 796 and 789 yuan/ton. The coking coal futures main contract JM2601 is likely to have a wide - range shock, with support levels at 1188 and 1166 yuan/ton, and resistance levels at 1223 and 1238 yuan/ton. The glass futures main contract FG601 is likely to have a weak - side shock, with support levels at 1194 and 1176 yuan/ton, and resistance levels at 1212 and 1226 yuan/ton. The soda ash futures main contract SA601 is likely to have a shock consolidation, with resistance levels at 1318 and 1325 yuan/ton, and support levels at 1297 and 1289 yuan/ton [3][4][67][72][74]. - **Lithium Carbonate Futures**: On September 19, the lithium carbonate futures main contract LC2511 is likely to have a wide - range shock, with resistance levels at 74100 and 75100 yuan/ton, and support levels at 72000 and 70300 yuan/ton [62]. 2. Macro - Information and Trading Tips - **Trade - related**: The Chinese Ministry of Commerce stated its stance on the TikTok issue and the EU's anti - subsidy tax on Chinese electric vehicles. It also mentioned the anti - dumping investigation on EU pork products [6][7]. - **Science and Technology Investment**: In the "14th Five - Year Plan" period, China's R & D investment increased, with the total R & D investment in 2024 exceeding 3.6 trillion yuan, a 48% increase from 2020. The R & D investment intensity reached 2.68%, exceeding the average level of EU countries. The DeepSeek - R1 reasoning model research paper was on the cover of "Nature", marking China's AI technology getting the highest recognition in the international scientific community [7][8]. - **Business and Economy**: The "2025 China's Top 500 Service Enterprises" were released, with the total operating income of the short - listed enterprises in 2024 reaching 51.1 trillion yuan. Beijing and Shanghai announced the upper and lower limits of social security contribution bases for 2025. The Shanghai government plans to support high - growth enterprises, offering up to 100,000 yuan in rewards for gazelle enterprises and up to 200,000 yuan for unicorn enterprises [7]. - **International Cooperation and Investment**: The US and the UK signed a science and technology cooperation agreement. BP plans to invest over 3.6 billion pounds in the US annually for the next five years, and CoreWeave will invest 1.5 billion pounds in the UK [8]. - **Employment and Unemployment**: The number of initial jobless claims in the US last week dropped to 231,000, the largest decline in nearly four years. However, the number of continued unemployment claims remained above 1.9 million, indicating some pressure in the labor market [9]. - **US Government Fund**: The US government is promoting a $5 - billion mineral investment fund [9]. - **UK Central Bank Policy**: The Bank of England maintained the interest rate at 4% and reduced the quantitative tightening scale from 100 billion pounds to 70 billion pounds in the next 12 months [9]. 3. Commodity Futures - Related Information - **Iron Ore Index**: The Iron Ore Working Committee of the China Iron and Steel Association arranged the launch of the import iron ore port spot price index [9]. - **International Precious Metal Futures**: On September 18, international precious metal futures generally closed down. COMEX gold futures fell 1.07% to $3678.2 per ounce, and COMEX silver futures fell 0.12% to $42.1 per ounce [10]. - **International Crude Oil Futures**: On September 18, international oil prices fell slightly. The US crude oil main contract fell 0.61% to $63.31 per barrel, and the Brent crude oil main contract fell 0.73% to $66.97 per barrel [10]. - **London Base Metals**: On September 18, most London base metals fell. LME tin fell 1.73% to $33750 per ton, LME zinc fell 1.04% to $2913 per ton, LME copper fell 0.50% to $9946 per ton, LME nickel fell 0.45% to $15335 per ton, LME lead fell 0.42% to $2004 per ton, and LME aluminum rose 0.82% to $2705 per ton [10].
监督学习未死,一题训练五小时起飞!华人学者新方法20倍训练效率释放大模型推理能力
量子位· 2025-08-04 07:00
Core Viewpoint - The article discusses the breakthrough of One-Shot Critique Fine-Tuning (One-Shot CFT) in enhancing reasoning capabilities of large language models (LLMs) with minimal data and computational resources, outperforming traditional reinforcement learning (RL) methods and small-scale supervised fine-tuning (SFT) approaches [1][3][14]. Group 1: One-Shot CFT Methodology - One-Shot CFT is a new method that allows models to learn reasoning by analyzing the quality of answers rather than merely imitating them, thus providing a deeper learning signal [3][12]. - The process involves selecting a representative task, generating multiple answers using various models, and then having a more powerful model critique these answers, which serves as the supervision signal for training [4][5]. - The entire training process requires only one question, multiple answers, and critiques, taking approximately 5 GPU hours, significantly less than RL methods [5][14]. Group 2: Performance and Results - In experiments, Qwen2.5-Math-7B achieved a 15% accuracy increase after One-Shot CFT fine-tuning on a single question, surpassing both RL and full supervised fine-tuning models that used tens of thousands of training samples [9][10]. - The method demonstrated strong performance across various mathematical and logical reasoning tasks, with accuracy improvements ranging from 10% to 16% in specific sub-tasks [10][11]. - One-Shot CFT showed stability and reproducibility across different tasks and model configurations, indicating its robustness [11][13]. Group 3: Advantages of One-Shot CFT - The method emphasizes critical learning, allowing models to understand why answers are correct or incorrect, which enhances the depth of learning compared to traditional SFT [12]. - It introduces multi-perspective inputs by generating multiple answers and critiques for a single task, closely mimicking human learning processes [12]. - The training signals from critiques are highly generalizable, reducing the risk of overfitting and allowing for easier transfer to new tasks [12]. Group 4: Accessibility and Practical Implications - One-Shot CFT's low computational cost makes it accessible for individual researchers, resource-limited labs, and startups, providing a cost-effective solution for enhancing reasoning capabilities [14][15]. - The entire process is open-source, including training scripts, model parameters, and datasets, which significantly lowers the barrier for replication and experimentation [17].
腾讯研究院AI速递 20250703
腾讯研究院· 2025-07-02 15:52
Group 1 - Cursor's developer Anysphere has poached two key figures, Boris Cherny and Cat Wu, from Claude Code, despite their close partnership [1] - Anthropic's annual revenue has reached $4 billion with a valuation of $61.5 billion, and its Claude model is regarded as the best programming model [1] - Anysphere's revenue has doubled within three months to an annual income of $500 million, with a valuation of $9.9 billion, intensifying competition in the AI programming market [1] Group 2 - Zhizhu has released the open-source GLM-4.1V-Thinking visual reasoning model, which surpasses an 8x parameter 72B model in 18 authoritative evaluations [2] - The model architecture integrates ViT visual encoders, MLP adapters, and GLM language decoders, enhancing processing capabilities with 2D-RoPE and 3D-RoPE positional encodings [2] - The training process consists of four stages: multi-modal pre-training, long-context continuous training, supervised fine-tuning, and curriculum sampling reinforcement learning, significantly improving logical reasoning abilities [2] Group 3 - Sakana AI has introduced the Adaptive Branch Monte Carlo Tree Search (AB-MCTS) algorithm, enhancing large model reasoning capabilities through flexible dual-directional search [3] - The Multi-LLM AB-MCTS system allows multiple cutting-edge models (Gemini 2.5 Pro, o4-mini, DeepSeek-R1-0528) to collaborate, achieving a 30% performance improvement on the ARC-AGI-2 benchmark test [3] - This algorithm dynamically selects the optimal model based on the problem, enabling collective intelligence to surpass the limitations of individual models, with the underlying framework TreeQuest open-sourced for user applications [3] Group 4 - HeyGen has launched a "product placement" feature that generates realistic promotional videos by simply uploading a character's avatar and product images, with Elon Musk promoting Labubu as a notable case [4] - Founded by two alumni from Tongji University, HeyGen is valued at $500 million with an annual revenue nearing $80 million, expected to surpass $100 million [5] - Compared to competitors like Topview, HeyGen excels in model expression naturalness and lip-sync accuracy, offering unlimited short video production for a monthly fee of $29 [5] Group 5 - Baidu has undergone its most significant self-revolution in nearly a decade by upgrading its search function to an AI smart box that supports ultra-long text, while still retaining the traditional search mode [6] - The introduction of the "Bai Kan" feature innovates the way search results are displayed, prioritizing the most useful rich media content such as video explanations and intelligent summaries [6] - The search functionality has evolved from simple information retrieval to task delivery, allowing users to obtain ratings, locations, and travel plans directly, even supporting one-click taxi booking or package purchases [6] Group 6 - Microsoft has released the MAI-DxO medical AI system, which boasts an accuracy rate of 85.5%, outperforming a professional doctor with 10 years of experience by four times [7] - MAI-DxO simulates a real medical team's sequential diagnostic process through collaboration among five virtual doctor roles [7] - The system offers five diagnostic modes to meet various scenario needs and has introduced a professional medical sequential diagnostic benchmark, SDBench, featuring 304 challenging diagnostic cases [7] Group 7 - Baidu has launched its self-developed multi-modal generative large model MuseSteamer and the "Hui Xiang" platform, supporting high-quality video generation at resolutions from 720p to 1080p, setting a new record on the VBench-I2V video generation leaderboard [8] - The model is available in four versions: Lite (720p fast speed), Turbo (720p excellent character motion), Pro (1080p cinematic quality), and Voice (automatically generates sound effects and dialogue), catering to different creative needs [8] - Key technological highlights include precise understanding of Chinese semantics, structured video description language, cinematic dynamic beauty generation, and integrated audio-video generation, already applied in advertising creativity and short drama production [8] Group 8 - Cloudflare has introduced the "Pay Per Crawl" experimental feature, allowing websites to set permissions, fees, or blocks for AI crawlers, granting content creators bargaining power over their content [10] - Data indicates a significant disparity between AI crawlers and traditional search engines: Google returns one click for every 6-7 crawls, while OpenAI requires 1,500 crawls and Anthropic 73,300 crawls for a single click, disrupting the existing ecological balance [10] - This feature implements fee control through HTTP 402 status codes and digital signature authentication mechanisms, currently in beta testing, potentially creating a new monetization model for internet content creators from "advertising monetization" to "content licensing monetization" [10] Group 9 - Chai Discovery, supported by OpenAI, has launched the Chai-2 multi-modal generative model, achieving a 16% hit rate in de novo antibody design, improving over 100 times compared to previous SOTA technologies [11] - Chai-2 can identify effective antibodies for 26 out of 52 test targets (50%) within a 24-well plate (≤20 designs) and can generate various forms of sequences, including scFv antibodies, VHH domains, and mini-binding sites [11] - The model employs a controllable model-driven framework, reducing the development cycle from months to two weeks, achieving a 68% success rate in wet lab experiments for micro-protein design, potentially unlocking drug development capabilities beyond traditional technologies [11] Group 10 - The New Yorker highlights that AI teaches humans to write "good" articles but causes truly good articles to disappear [12] - The article points out that AI is reconstructing culture with an "average" logic, leading to standardization and loss of uniqueness in writing, with MIT experiments showing a significant reduction in brain activity levels among students using ChatGPT for writing [12] - Research indicates that AI leads to cultural homogenization, with Cornell University experiments confirming that AI-assisted writing styles of users from India and the US converge towards a "Western paradigm," with common references to pizza and Christmas [12]
o3-pro答高难题文字游戏引围观,OpenAI前员工讽刺苹果:这都不叫推理那什么叫推理
量子位· 2025-06-13 02:25
Core Viewpoint - OpenAI's latest reasoning model, o3-pro, demonstrates strong reasoning capabilities but has mixed performance in various evaluations, indicating a need for context and specific prompts to maximize its potential [1][2][3][4]. Evaluation Results - o3-pro achieved a correct answer in 4 minutes and 25 seconds during a reasoning test, showcasing its ability to process complex queries [2]. - In official evaluations, o3-pro surpassed previous models like o3 and o1-pro, becoming the best coding model from OpenAI [8]. - However, in the LiveBench ranking, o3-pro showed only a slight advantage over o3 with a score difference of 0.07, and it lagged behind o3 in agentic coding scores (31.67 vs 36.67) [11]. Contextual Performance - o3-pro excels in short context scenarios, showing improvement over o3, but struggles with long context processing, scoring 65.6 compared to Gemini 2.5 Pro's 90.6 in 192k context tests [15][16]. - The model's performance is highly dependent on the background information provided, as noted by user experiences [24][40]. User Insights - Bindu Reddy, a former executive at Amazon and Google, pointed out that o3-pro lacks proficiency in tool usage and agent capabilities [12]. - Ben Hylak, a former engineer at Apple and SpaceX, emphasized that o3-pro's effectiveness increases significantly when treated as a report generator rather than a chat model, requiring ample context for optimal results [22][24][26]. Comparison with Other Models - Ben Hylak found o3-pro's outputs to be superior to those of Claude Opus and Gemini 2.5 Pro, highlighting its unique value in practical applications [39]. - The model's ability to understand its environment and accurately describe tool usage has improved, making it a better coordinator in tasks [30][31]. Conclusion - The evaluation of o3-pro reveals that while it has advanced reasoning capabilities, its performance is contingent on the context and prompts provided, necessitating a strategic approach to maximize its utility in various applications [40][41].
DeepSeekR1幻觉率最高降低50%,用户喊话想要R2模型
Di Yi Cai Jing· 2025-05-29 14:10
Core Insights - The updated R1 model from DeepSeek has significantly improved its capabilities, particularly in reducing the "hallucination" rate, which previously stood at around 21% [1][4]. Model Performance - The new R1 model has achieved top-tier performance in various benchmark tests, surpassing all domestic models and nearing the performance of international leaders like o3 and Gemini-2.5-Pro [4]. - The hallucination rate has been reduced by approximately 45%-50% in tasks such as rewriting, summarization, and reading comprehension, providing more accurate and reliable results [4][18]. - In the AIME 2025 test, the model's accuracy improved from 70% to 87.5% in complex reasoning tasks [18]. Model Features and Capabilities - The updated R1 model can generate longer and more structured pieces of writing, including essays, novels, and prose, while aligning more closely with human writing styles [18]. - The model's coding capabilities have also seen significant enhancements, performing nearly on par with OpenAI's o3-high model in code testing environments [18]. - The new model has a parameter count of 685 billion and supports a context length of 128K in the open-source version [19]. Future Developments - There is considerable anticipation in the industry for the next-generation R2 model, with users expressing their eagerness for its release [19]. - DeepSeek has not commented on speculations regarding the R2 model, but the ongoing competition in the foundational model space remains intense [19].
大模型玩不好数独?!Transformer作者初创公司公布排行榜:o3 Mini High“变异数独”正确率仅2.9%
量子位· 2025-05-28 04:22
Core Insights - The article discusses the performance of AI models in solving Sudoku puzzles, revealing that the overall accuracy is only 15%, with the best model achieving just 2.9% accuracy on 9x9 puzzles [1][25]. Group 1: AI Model Performance - Sakana AI introduced a new benchmark called Sudoku-Bench, which tests AI models on various Sudoku puzzles ranging from 4x4 to 9x9 [1][6]. - The leaderboard shows that the top-performing model, O3 Mini High, solved 14% of puzzles, while other models like Gemini 2.5 Pro and Qwen 3 235B A22B achieved 11% and 8% respectively [2][22]. - Even the most advanced models struggled, with many failing to place even a single correct number in the puzzles [21][25]. Group 2: Challenges Faced by AI Models - A significant issue identified is the "memory dependence" of large models, where they rely on memorized solutions rather than logical reasoning [7][8]. - Models often fail to adapt to new rules or unseen patterns, leading to ineffective problem-solving strategies [9][10]. - Traditional Sudoku puzzles may be too simplistic for these models, as they tend to memorize patterns instead of developing creative problem-solving skills [10]. Group 3: Innovative Testing Approach - The Sudoku-Bench includes "variant Sudoku" puzzles that require multi-step reasoning and cannot be solved through memory alone, making them ideal for testing AI reasoning capabilities [11][12]. - The benchmark features both traditional and modern Sudoku problems, with varying difficulty levels [15][16]. Group 4: Company Background - Sakana AI was founded in July 2023 by former Google researchers Llion Jones and David Ha, focusing on generative AI models [24]. - The company has previously released AI models capable of generating academic papers and reviewing AI-generated content [26][29].
清华学霸与AI比做高考压轴题,谁会赢?
第一财经· 2025-05-27 15:21
Core Viewpoint - The article discusses the significant advancements in AI's reasoning capabilities, particularly in the context of education, as demonstrated by a competition between top students from Tsinghua University and AI models in solving challenging exam questions [1][2]. Group 1: AI Advancements - The AI model DeepSeek-R1 has led to a breakthrough in reasoning capabilities, showing high adaptability in educational scenarios and improving the quality of guidance and Q&A [2]. - In a recent test, AI achieved a score of 697 out of 750 on a new high school exam, indicating a performance level comparable to top-tier universities [2]. - The performance of AI models in mathematics has been a focus, with OpenAI's o3-mini demonstrating superior reasoning capabilities in the FrontierMath benchmark [3]. Group 2: Educational Impact - The AI's ability to solve high-difficulty math problems has garnered attention, especially in the context of national exams, which are widely recognized for their difficulty [2]. - The online education market is projected to see an increase in AI's contribution from 7% to 16% between 2023 and 2027, highlighting the growing integration of AI in educational settings [3].