1美元Token撬动4800美元收益！AI挑战百万美元级基准，最赚钱的Agent出现了

Core Insights - The article discusses the $OneMillion-Bench, which evaluates AI's ability to perform tasks equivalent to human experts valued at $1 million, revealing that current AI can deliver approximately $480,000 worth of work for a token cost of about $100 [2][14]. Evaluation Framework - The $OneMillion-Bench is developed by xbench in collaboration with top institutions and experts, featuring a dual-track assessment system: AGI Tracking and Profession Aligned, focusing on the model's intelligence limits and practical utility in real business workflows [2][3]. - The benchmark includes 400 high-difficulty tasks (200 in English and 200 in Chinese) across five major fields: finance, law, healthcare, natural sciences, and industry, designed to reflect real-world expert tasks [5][9]. Economic Value Calculation - Each task's economic value is determined by the time taken by senior experts multiplied by their hourly wage, with the total value exceeding $1 million, thus providing a more tangible measure of AI's deliverable value [7][8]. - The economic value breakdown by domain shows that finance, law, and healthcare tasks contribute significantly to the total value, with finance tasks alone valued at approximately $296,432 in China and $183,726 globally [8]. Key Design Features - The benchmark emphasizes high-value tasks that reflect real-world economic value, incorporating a detailed breakdown of tasks into 15-35 assessment points, totaling over 7,000 points [9]. - A non-symmetric negative scoring mechanism is introduced to prevent models from gaming the system, ensuring that logical structure and content quality are prioritized in evaluations [10]. - The tasks are categorized into 92 subfields, with separate assessments for Chinese and global contexts, enhancing the relevance and accuracy of the evaluations [11]. Performance Insights - The top-performing models have a pass rate exceeding 40%, indicating that while AI can deliver substantial value, it still has a way to go before being fully reliable for complex tasks [14][16]. - The article highlights that while average scores suggest competence, the pass rate (tasks scoring 70% or above) is a more accurate measure of AI's readiness for real-world applications [16][17]. - Complex reasoning remains a challenge for AI, with models often lacking the depth and detail required for high-stakes tasks, particularly in fields like software engineering and healthcare [19]. Future Directions - The $OneMillion-Bench aims to further assess AI's capabilities in high-barrier expert tasks, transitioning AI from a basic efficiency tool to a digital employee capable of working alongside top human experts [20].