Seek .-《我的世界》成为AI新「考场」？高三生用游戏评测AI：DeepSeek-R1位列第三

Core Insights - A high school student, Adi Singh, has developed a new AI evaluation benchmark called MC-Bench, utilizing the game Minecraft to assess AI models' capabilities in a more intuitive manner [1][2][10] - Traditional standardized tests often give AI models an unfair advantage, as they are optimized for specific tasks, leading to discrepancies in real-world performance [2][8] - MC-Bench allows users to vote on AI-generated architectural designs in Minecraft, providing a crowdsourced method for evaluating AI performance [5][9] Group 1: MC-Bench Overview - MC-Bench is designed to evaluate AI models by having them create structures in Minecraft based on user prompts, such as "a crystal-clear wine glass filled with deep red wine" [2][5] - The evaluation process involves user voting to select the best creations, with results revealed only after voting concludes [5][10] - The project has garnered attention from major AI companies like OpenAI, Google, and Anthropic, which provide computational resources but are not officially collaborating [10][13] Group 2: Advantages of Game-Based Evaluation - Minecraft serves as a familiar and visually engaging platform, making it easier for the general public to understand and participate in AI assessments [7][8] - The game environment allows for a controlled testing space, enabling the evaluation of AI's reasoning and planning abilities in a safe manner [7][8] - Game-based assessments can simulate real-world complexities, test AI's decision-making skills, and provide a repeatable environment for comparison [7][8] Group 3: Current Status and Future Plans - As of now, MC-Bench primarily tests basic construction abilities of AI models, tracking their progress since the GPT-3 era [10][16] - Future plans include expanding the benchmark to more complex tasks that require long-term planning and goal-oriented actions [10][16] - The leaderboard of MC-Bench shows that Claude 3.7 Sonnet ranks first, while DeepSeek-R1 is currently in third place, indicating the platform's effectiveness in reflecting user experiences with these models [14][16]