Tool Call
Search documents
GPT-5.2果然反超谷歌Gemini 3 Pro!北大数院校友核心贡献
量子位· 2025-12-12 01:00
Core Insights - OpenAI has released GPT-5.2, which significantly enhances capabilities in various practical fields, including spreadsheet creation, presentation design, coding, and understanding lengthy documents [1][2][3] - The model shows a marked improvement in visual understanding, accurately identifying more components on circuit boards [4] - GPT-5.2 has achieved a new state-of-the-art score of 90.5% in the ARC-AGI-1 test, with a dramatic reduction in task costs from $4,500 to $11.64, indicating a 390-fold efficiency increase over the past year [12][13] Performance Enhancements - GPT-5.2 demonstrates a 71% win rate against human experts in GDPval tests, completing tasks that typically take humans 4-8 hours in a fraction of the time [18][19] - In investment banking tasks, GPT-5.2 Thinking improved its score from 59.1% to 68.4%, reflecting a 9.3% increase in performance [21] - The model's coding capabilities have also improved, achieving an 80% score on SWE-bench Verified and 55.6% on the more challenging SWE-Bench Pro [25][26] Visual and Contextual Understanding - The model has shown a 50% reduction in error rates for understanding scientific paper graphics and has improved spatial awareness of elements in images [34][36] - GPT-5.2 Thinking is the first model to achieve near 100% accuracy on a 256k context length task, showcasing its ability to handle long documents effectively [30] Tool Utilization and Scientific Applications - Tool invocation capabilities have reached new heights, with GPT-5.2 achieving 98.7% in multi-turn interactions in telecom scenarios [40] - In scientific assessments, GPT-5.2 Pro scored 93.2% in GPQA Diamond evaluations, indicating its suitability for assisting researchers [45] Team and Development Insights - OpenAI's recent advancements have been attributed to a new wave of talent, many of whom have strong mathematical backgrounds and joined the company in 2024 [57][58][59]
Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran
AI Engineer· 2025-06-10 17:30
Agent Evaluation Challenges - Building agents is difficult, requiring iteration at the prompt, model, and tool call definition levels [2][3] - Systematically tracking the performance of new prompts versus previous ones is challenging [4] - Including product managers or other team members in the iterative evaluation process is difficult [5] - Identifying bottlenecks in applications and pinpointing specific sub-agents or tool calls that create poor responses is hard [6] Evaluation Components - Agent evaluation should include evaluating at the tool call level, considering whether the right tool was called and if the correct arguments were passed [7][11] - Trajectory evaluation is important to determine if tool calls are executed in the correct order across a series of steps [7][20] - Multi-turn conversation evaluation is necessary to assess consistency in tone and context retention across multiple interactions [8][22][23] - Improving evaluation prompts is crucial, as the evals used to identify failure cases are essential for improving the agent [8][27] Arise Product Features - Arise offers a product for tracing and evaluating agent performance, allowing teams to ask questions about application performance and suggest improvements [12][13] - The product provides a high-level view of different paths an agent can take, helping to pinpoint performance bottlenecks [14][15] - Users can drill down into specific traces to evaluate tool call correctness and argument alignment [17][18]