Xpert

Search documents
马斯克转发字节Seed&哥大商学院新基准:大模型搞金融,连查个股价都能出错
Sou Hu Cai Jing· 2025-09-21 02:34
Core Insights - The article discusses the launch of FinSearchComp, an open-source financial search and reasoning benchmark developed by ByteDance's Seed team in collaboration with Columbia Business School, aimed at evaluating AI's performance in financial analysis tasks [1][3][5] Evaluation Results - The best-performing model, Grok 4 (web), achieved an accuracy of 68.9% on the global dataset, which is still 6.1 percentage points behind human experts. In the Greater China dataset, Doubao (web) led with an accuracy of 53.3%, falling short by over 34 percentage points compared to human experts' 88.3% [1][11] Task Design - FinSearchComp includes three progressively challenging task categories that reflect the complexity of financial analysts' daily work: 1. Time-sensitive data fetching, focusing on real-time data like stock prices [7] 2. Simple historical lookup, requiring fixed-point fact retrieval [7] 3. Complex historical investigation, demanding multi-period aggregation and analysis [7] Data Reliability - The benchmark's quality is supported by ByteDance's Xpert platform, which provides expert knowledge and high-quality AI training data. The project involved 70 financial experts, ensuring data reliability through cross-validation from official sources and professional financial databases [9][10] Importance of Search Capability - The evaluation highlighted the critical role of search capabilities, with models equipped with web search functionality showing significant performance improvements across tasks. Models without search capabilities scored zero on time-sensitive tasks, emphasizing the necessity of real-time data access for accurate financial analysis [12][11] Industry Implications - The findings suggest that while AI can assist in financial data retrieval, it still has considerable room for improvement. The article advocates for the establishment of a comprehensive evaluation system for financial AI, akin to a "driving license" for AI products, to ensure reliability before they can fully replace human analysts [13]