Workflow
Gemini 2.5 Pro
icon
Search documents
深夜炸场!Claude Sonnet 4.5上线,自主编程30小时,网友实测:一次调用重构代码库,新增3000行代码却运行失败
AI科技大本营· 2025-09-30 10:24
Core Viewpoint - The article discusses the release of Claude Sonnet 4.5 by Anthropic, highlighting its advancements in coding capabilities and safety features, positioning it as a leading AI model in the market [1][3][10]. Group 1: Model Performance - Claude Sonnet 4.5 has shown significant improvements in coding tasks, achieving over 30 hours of sustained focus in complex multi-step tasks, compared to approximately 7 hours for Opus 4 [3]. - In the OSWorld evaluation, Sonnet 4.5 scored 61.4%, a notable increase from Sonnet 4's 42.2% [6]. - The model outperformed competitors like GPT-5 and Gemini 2.5 Pro in various tests, including Agentic coding and terminal coding [7]. Group 2: Safety and Alignment - Claude Sonnet 4.5 is touted as the most "aligned" model to date, having undergone extensive safety training to mitigate risks associated with AI-generated code [10]. - The model received a low score in automated behavior audits, indicating a lower risk of misalignment behaviors such as deception and power-seeking [11]. - It adheres to AI Safety Level 3 (ASL-3) standards, incorporating classifiers to filter dangerous inputs and outputs, particularly in sensitive areas like CBRN [13]. Group 3: Developer Tools and Features - Anthropic has introduced several updates to Claude Code, including a native VS Code plugin for real-time code modification tracking [15]. - The new checkpoint feature allows developers to automatically save code states before modifications, enabling easy rollback to previous versions [21]. - The Claude Agent SDK has been launched, allowing developers to create custom agent experiences and manage long tasks effectively [19]. Group 4: Market Context and Competition - The article notes a competitive landscape with other AI models like DeepSeek V3.2 also making significant advancements, including a 50% reduction in API costs [36]. - There is an ongoing trend of rapid innovation in AI tools, with companies like OpenAI planning new product releases to stay competitive [34].
深夜炸场,Claude Sonnet 4.5上线,自主编程30小时,网友实测:一次调用重构代码库,新增3000行代码却运行失败
3 6 Ke· 2025-09-30 08:43
Core Insights - Anthropic has launched the Claude Sonnet 4.5, claiming it to be the "best coding model in the world" with significant improvements over its predecessor, Opus 4 [1][2]. Performance Enhancements - Claude Sonnet 4.5 can autonomously run for over 30 hours on complex multi-step tasks, a substantial increase from the 7 hours of Opus 4 [2]. - In the OSWorld evaluation, Sonnet 4.5 achieved a score of 61.4%, up from 42.2% of Sonnet 4, indicating a marked improvement in computer operation capabilities [4]. - The model outperformed competitors like GPT-5 and Gemini 2.5 Pro in various tests, including Agentic Coding and Agentic Tool Use [6][7]. Safety and Alignment - Claude Sonnet 4.5 is touted as the most "aligned" model to date, having undergone extensive safety training to mitigate issues like "hallucination" and "deception" [9][10]. - It has received an AI Safety Level 3 (ASL-3) rating, equipped with protective measures against dangerous inputs and outputs, particularly in sensitive areas like CBRN [12]. Developer Tools and Features - The update includes a native VS Code plugin for Claude Code, allowing real-time code modification tracking and inline diffs [13]. - A new checkpoint feature enables developers to save code states automatically, facilitating easier exploration and iteration during complex tasks [18]. - Claude API has been enhanced with context editing and memory tools, enabling the handling of longer and more complex tasks [20]. Market Response and Competition - Developers have expressed surprise at the capabilities of Claude Sonnet 4.5, with reports of it autonomously generating complete projects [21][22]. - The competitive landscape is intensifying, with other companies like DeepSeek also releasing new models that significantly reduce inference costs [29][32].
Claude Sonnet 4.5来了!能连续编程30多小时、1.1万行代码
机器之心· 2025-09-30 00:27
Core Insights - The article discusses the recent advancements in AI models, particularly the release of Claude Sonnet 4.5 by Anthropic, which is positioned as a leading model in various benchmarks and applications [1][4][5]. Model Performance - Claude Sonnet 4.5 achieved significant performance improvements in various benchmarks, including: - 77.2% in Agentic coding [2] - 82.0% in SWE-bench Verified [2] - 61.4% in OSWorld for computer use, up from 42.2% in the previous version [11] - The model shows enhanced capabilities in reasoning and mathematics, with a perfect score of 100% in high school math competitions [12][13]. Developer Tools and Features - Anthropic introduced the Claude Agent SDK, allowing developers to create their own intelligent agents [4][35]. - New features include checkpoint functionality for saving progress, a revamped terminal interface, and native VS Code extensions [8][4]. Safety and Alignment - Claude Sonnet 4.5 is noted for being the most aligned model to human values, with improvements in reducing undesirable behaviors such as flattery and deception [27][5]. - The model is released under AI safety level 3 (ASL-3), incorporating classifiers to detect potentially dangerous inputs and outputs [32]. User Experience and Applications - Early user experiences indicate that Claude Sonnet 4.5 performs exceptionally well in specialized fields such as finance, law, and STEM [13][21]. - The "Imagine with Claude" feature allows real-time software generation without pre-defined functions, showcasing the model's adaptability [36][38].
HLE“人类最后考试”首次突破60分,Eigen-1基于DeepSeek V3.1显著领先Grok4、GPT-5
3 6 Ke· 2025-09-28 12:05
Core Insights - Eigen-1 multi-agent system has achieved a historic breakthrough with Pass@1 accuracy of 48.3% and Pass@5 accuracy of 61.74% on the HLE Bio/Chem Gold test set, surpassing competitors like Google Gemini 2.5 Pro and OpenAI GPT-5 [1][6][27] - The success is attributed to three innovative mechanisms: Monitor-based RAG, Hierarchical Solution Refinement (HSR), and Quality-Aware Iterative Reasoning (QAIR) [2][5][12] Technical Innovations - **Monitor-based RAG**: This mechanism eliminates the "tool tax" associated with traditional retrieval-augmented generation systems by continuously monitoring reasoning flow and seamlessly integrating retrieved knowledge, resulting in a 53.5% reduction in token consumption and a 43.7% decrease in workflow iterations [8][10] - **Hierarchical Solution Refinement (HSR)**: HSR introduces a hierarchical collaboration model that allows stronger solutions to absorb valuable insights from weaker ones, enhancing the overall quality of the output [12][15] - **Quality-Aware Iterative Reasoning (QAIR)**: This mechanism adapts the depth of iterations based on the quality of answers, ensuring efficient resource utilization by focusing on low-quality candidates for further exploration [15][18] Performance Metrics - Eigen-1's performance metrics demonstrate its superiority across various benchmarks, achieving Pass@1 of 48.3% and Pass@5 of 61.74% on HLE Bio/Chem Gold, and significantly higher scores on SuperGPQA Hard and TRQA [17] - The model's accuracy improved from 25.3% to 48.3% through the integration of various components, showcasing the effectiveness of the innovative mechanisms [20][21] Insights on Error Patterns - Analysis reveals that 92.78% of errors stem from reasoning process issues, indicating that the core challenge lies in integrating knowledge with reasoning rather than mere knowledge retrieval [18] Implications for AI in Science - The breakthrough signifies a new paradigm for AI-assisted scientific research, suggesting that AI can effectively understand and reason through complex human knowledge, thus accelerating the research process [27]
HLE“人类最后考试”首次突破60分!Eigen-1基于DeepSeek V3.1显著领先Grok4、GPT-5
量子位· 2025-09-28 11:54
Core Insights - The article highlights a significant breakthrough in AI capabilities with the Eigen-1 multi-agent system achieving a Pass@1 accuracy of 48.3% and Pass@5 accuracy of 61.74% on the HLE Bio/Chem Gold test set, surpassing major competitors like Google Gemini 2.5 Pro and OpenAI GPT-5 [1][5][39]. Technical Innovations - The success of Eigen-1 is attributed to three innovative mechanisms: Monitor-based RAG, Hierarchical Solution Refinement (HSR), and Quality-Aware Iterative Reasoning (QAIR) [3][15][20]. - Monitor-based RAG reduces the "tool tax" associated with traditional retrieval-augmented generation systems, leading to a 53.5% reduction in token consumption and a 43.7% decrease in workflow iterations while maintaining higher accuracy [11][12][37]. - HSR introduces a hierarchical collaboration model that allows stronger solutions to absorb valuable insights from weaker ones, enhancing the overall problem-solving process [15][18]. - QAIR optimizes the iterative reasoning process by adjusting the depth of exploration based on the quality of answers, ensuring efficient resource utilization [20][21]. Performance Metrics - Eigen-1's performance metrics indicate a significant lead over competitors, with Pass@1 and Pass@5 scores of 48.3% and 61.74% respectively in HLE Bio/Chem Gold, and also strong performances in SuperGPQA Hard and TRQA tasks [27][22]. - The article provides a comparative table showcasing the performance of various models, highlighting Eigen-1's superior results [22]. Insights on Error Patterns - Analysis reveals that 92.78% of errors stem from reasoning process issues, indicating that the core challenge lies in seamlessly integrating knowledge with reasoning rather than mere knowledge retrieval [24][25]. - The article notes that execution and understanding errors are relatively low, suggesting that models have matured in instruction comprehension [26]. Component Contribution Analysis - The team conducted ablation studies to quantify the contributions of each component, demonstrating that the baseline system achieved only 25.3% accuracy without external knowledge, while the full system reached 48.3% accuracy with efficient token usage [29][31]. Implications for AI in Science - The breakthrough signifies a new paradigm for AI-assisted scientific research, suggesting that AI can become a powerful ally for scientists in tackling complex problems [39][40]. - The research team plans to continue optimizing the architecture and exploring applications in other scientific fields, indicating a commitment to advancing AI capabilities in research workflows [42].
OpenAI研究大模型对GDP贡献,三大行业已能代替人类,并自曝不敌Claude
机器之心· 2025-09-27 06:13
Core Viewpoint - The article discusses the introduction of GDPval, a new evaluation method by OpenAI that assesses AI model performance on economically valuable real-world tasks, indicating that AI is nearing human-level performance in various industries [1][3][22]. Group 1: Evaluation Methodology - GDPval uses GDP as a key economic indicator and extracts tasks from critical occupations in the top nine industries contributing to the GDP [3][16]. - The evaluation includes 1,320 professional tasks, with a golden open-source subset of 220 tasks, designed and reviewed by experienced professionals [18][22]. - Tasks are based on real work outcomes, ensuring the evaluation's realism and diversity compared to other benchmarks [18][19]. Group 2: Model Performance - The evaluation results show that leading models like Claude Opus 4.1 and GPT-5 are approaching or matching the quality of human experts in various tasks [4][9]. - Claude Opus 4.1 excels in aesthetic tasks, while GPT-5 performs better in accuracy-related tasks [9][10]. - Performance improvements have been significant, with task completion speed being approximately 100 times faster and costs being 100 times lower than human experts [13]. Group 3: Industry Impact - AI has reached or surpassed human-level capabilities in sectors such as government, retail, and wholesale [7]. - The early results from GDPval suggest that AI can complete some repetitive tasks faster and at a lower cost than human experts, potentially transforming the job market [21]. - OpenAI aims to democratize access to these tools, enabling workers to adapt to changes and fostering economic growth through AI integration [21]. Group 4: Future Developments - OpenAI plans to expand GDPval to include more occupations, industries, and task types, enhancing interactivity and addressing more ambiguous tasks [22]. - The ongoing improvements in the evaluation method indicate a commitment to better measure the progress of diverse knowledge work [22].
谁是最强“打工AI”?OpenAI亲自测试,结果第一不是自己
量子位· 2025-09-26 04:56
西风 发自 凹非寺 量子位 | 公众号 QbitAI OpenAI发布最新研究,却在里面夸了一波Claude。 他们 提出名为 G D Pv al 的新基 准 ,用来衡量AI模型在真实世界具有经济价值的任务上的表现。 最后OpenAI还 开源了包含220项任务的优质子集 ,并提供公开的自动评分服务。 具体来说,GDPval覆盖了 对美国GDP贡献最大的9个行业中的44种职业 ,这些职业年均创收合计达3万亿美元。任务基于平均拥有14年经验 的行业专家的代表性工作设计而成。 专业评分人员将主流模型的输出结果与人类专家的成果进行了对比。 最终测试下来, Claude Opus 4.1成为表现最佳的模型,47.6%的产出被评定媲美人类专家成果 。 GPT-5 38.8%的成绩和Claude还是有些差距,位居第二;GPT-4o与人类相比只有12.4%获胜或平局。 没能成为最优,OpenAI也给自己找补了:不同模型各有优势,Claude Opus 4.1主要是在美学方面突出,而 G P T-5在准 确 性 上更优 。 OpenAI还表示,同样值得注意的是模型的进步速度,其前沿模型在短短一年内,胜率几乎实现了翻倍。 网友看 ...
从应用层到数据层,谷歌三线出击,发动了一场立体AI战争
3 6 Ke· 2025-09-25 10:00
Core Insights - Google has launched a comprehensive AI strategy that targets both consumer and business markets, indicating a deeper strategic approach than previously perceived [1][7] Group 1: Consumer Market Strategy - Google introduced the AI Plus subscription at approximately $5 per month in over 40 countries, strategically targeting the "sensitive $20 price range" to penetrate markets with lower purchasing power [2] - The pricing strategy is designed to test the limits of payment capability, aiming for user habit formation with minimal entry barriers [2] - The Gemini 2.5 Pro's capabilities, such as processing 45-minute videos or 8-hour audio, differentiate it from competitors like ChatGPT Go, positioning it as a "multimodal productivity suite" [2] Group 2: Vertical Market Penetration - Google launched the Mixboard tool, which allows users to create mood boards in a fraction of the time compared to competitors like Pinterest, enhancing user experience through advanced image processing capabilities [4] - The Mixboard tool integrates seamlessly with Google Shopping, creating a closed-loop ecosystem that enhances user engagement and retention [4] Group 3: Business Infrastructure Development - The release of the Data Commons MCP Server addresses the issue of AI hallucinations by providing high-quality structured data to AI systems, effectively creating a "fact library" [5] - Google aims to establish itself as a trusted data infrastructure provider in the AI era by promoting "data democratization" and setting open standards through initiatives like the MCP [5] - By offering tools like Gemini CLI and Colab notebook, Google is strategically locking developers into its data ecosystem, solidifying its position as a rule-maker in the industry [5] Group 4: Competitive Landscape - Google's multi-faceted approach outlines a clear competitive landscape in AI, focusing on consumer base expansion, vertical tool penetration, and establishing industry authority through data infrastructure [7] - The strategy aims to transform AI from a mere tool into an "ecosystem operating system," embedding users, creators, and developers deeply within Google's AI network [7] - Competitors like OpenAI and Pinterest may have limited time to adapt and find differentiation points in response to Google's aggressive strategy [7]
阿里开源Qwen3-VL系列旗舰模型 包含两个版本
Di Yi Cai Jing· 2025-09-25 06:08
据通义千问Qwen公众号消息,阿里推出全新升级的Qwen3-VL系列,这是迄今为止Qwen系列中最强大 的视觉理解模型。此次率先开源的是该系列的旗舰模型——Qwen3-VL-235B-A22B,同时包含Instruct与 Thinking两个版本。据介绍,Instruct版本在多项主流视觉感知评测中,性能达到甚至超过Gemini 2.5 Pro;而Thinking版本则在众多多模态推理的评测基准下取得了SOTA的表现。 (文章来源:第一财经) ...
短短几分钟,AI轻松通过了CFA三级考试
华尔街见闻· 2025-09-25 04:09
Core Insights - Recent research indicates that multiple AI models can pass the prestigious CFA Level III exam in just a few minutes, a feat that typically requires humans several years and around 1000 hours of study [1][3]. Group 1: AI Model Performance - A total of 23 large language models were tested, with leading models like o4-mini, Gemini 2.5 Pro, and Claude Opus successfully passing the CFA Level III mock exam [1][4]. - The Gemini 2.5 Pro model achieved the highest overall score of 2.10, while also scoring 3.44 in essay evaluations, making it the top performer [5][10]. - The KIMI K2 model excelled in multiple-choice questions with an accuracy rate of 78.3%, outperforming Google's Gemini 2.5 Pro and GPT-5 [6][10]. Group 2: Technological Advancements - The research highlights that AI models have overcome previous barriers, particularly in the essay section of the CFA Level III exam, which was a significant challenge for AI two years ago [3][4]. - The use of "chain-of-thought prompting" techniques has enabled these advanced reasoning models to effectively tackle complex financial problems [2][4]. Group 3: Evaluation Metrics - The study employed three prompting strategies: zero-shot, self-consistency, and self-discovery, with self-consistency yielding the best performance score of 73.4% [9]. - In terms of cost efficiency, the Llama 3.1 8B Instant model received a score of 5468, while the Palmyra Fin model achieved the fastest average response time of 0.3 seconds [9][10]. Group 4: Limitations of AI - Despite the impressive performance of AI in standardized testing, industry experts caution that AI cannot fully replace human financial professionals due to limitations in understanding context and intent [10].