Claude Opus 4.6 Web Search
Search documents
1美元Token撬动4800美元收益!AI挑战百万美元级基准,最赚钱的Agent出现了
红杉汇· 2026-03-11 00:04
Core Insights - The article discusses the $OneMillion-Bench, which evaluates AI's ability to perform tasks equivalent to human experts valued at $1 million, revealing that current AI can deliver approximately $480,000 worth of work for a token cost of about $100 [2][14]. Evaluation Framework - The $OneMillion-Bench is developed by xbench in collaboration with top institutions and experts, featuring a dual-track assessment system: AGI Tracking and Profession Aligned, focusing on the model's intelligence limits and practical utility in real business workflows [2][3]. - The benchmark includes 400 high-difficulty tasks (200 in English and 200 in Chinese) across five major fields: finance, law, healthcare, natural sciences, and industry, designed to reflect real-world expert tasks [5][9]. Economic Value Calculation - Each task's economic value is determined by the time taken by senior experts multiplied by their hourly wage, with the total value exceeding $1 million, thus providing a more tangible measure of AI's deliverable value [7][8]. - The economic value breakdown by domain shows that finance, law, and healthcare tasks contribute significantly to the total value, with finance tasks alone valued at approximately $296,432 in China and $183,726 globally [8]. Key Design Features - The benchmark emphasizes high-value tasks that reflect real-world economic value, incorporating a detailed breakdown of tasks into 15-35 assessment points, totaling over 7,000 points [9]. - A non-symmetric negative scoring mechanism is introduced to prevent models from gaming the system, ensuring that logical structure and content quality are prioritized in evaluations [10]. - The tasks are categorized into 92 subfields, with separate assessments for Chinese and global contexts, enhancing the relevance and accuracy of the evaluations [11]. Performance Insights - The top-performing models have a pass rate exceeding 40%, indicating that while AI can deliver substantial value, it still has a way to go before being fully reliable for complex tasks [14][16]. - The article highlights that while average scores suggest competence, the pass rate (tasks scoring 70% or above) is a more accurate measure of AI's readiness for real-world applications [16][17]. - Complex reasoning remains a challenge for AI, with models often lacking the depth and detail required for high-stakes tasks, particularly in fields like software engineering and healthcare [19]. Future Directions - The $OneMillion-Bench aims to further assess AI's capabilities in high-barrier expert tasks, transitioning AI from a basic efficiency tool to a digital employee capable of working alongside top human experts [20].
1美元Token撬动4800美元收益!AI挑战百万美元级基准,最赚钱的Agent出现了
机器之心· 2026-03-10 01:32
Core Insights - The article discusses the economic value of AI in completing expert-level tasks, estimating that AI can deliver $480,000 worth of work for a task valued at $1 million, with a cost of only $100 in tokens [1][17] - The $OneMillion-Bench was developed to measure AI's performance in terms of economic value, using real-world expert tasks across various fields [4][11] Group 1: $OneMillion-Bench Overview - $OneMillion-Bench includes 400 challenging tasks across five major fields: finance, law, healthcare, natural sciences, and industry, with tasks designed to reflect real-world scenarios [4][8] - The benchmark aims to provide a high economic value and differentiation, allowing for automated assessment of AI models [4][11] - Each task is assigned an economic value based on the time and hourly wage of expert completion, totaling over $1 million [8][9] Group 2: Task Design and Evaluation - The benchmark incorporates a diverse range of real-world expert tasks, with each task designed to assess specific decision-making abilities in realistic scenarios [12][14] - A non-linear scoring system is used to evaluate model performance, where positive scoring is conservative, and significant penalties are applied for critical errors [13][11] - The tasks are categorized into 92 subfields, with separate assessments for Chinese and global contexts to accurately reflect regional differences [14][11] Group 3: Model Performance Insights - The strongest models currently achieve a pass rate of over 40%, indicating that while AI can deliver substantial value, it still has a way to go before being fully reliable for professional tasks [17][19] - The average scores of top models suggest they are capable of covering many key points, but the pass rate indicates that fewer than half of the tasks meet the required standards for delivery [19][20] - Complex reasoning remains a challenge for AI models, with difficulties in providing executable details in tasks requiring deep understanding and multi-step reasoning [23][19] Group 4: Future Directions - The article emphasizes the need for AI to enhance its deliverable value and ensure that this value is stable, verifiable, and controllable to translate improvements into productivity and revenue [25][26] - The significance of $OneMillion-Bench lies in quantifying the capabilities of "digital employees," helping to determine which tasks can be confidently assigned to AI [26]