狼人基准测试 - filings, earnings calls, financial reports, news

狼人基准测试

Search documents

3 6 Ke· 2025-09-01 07:31

Core Insights - The article discusses a competitive event where seven leading large language models (LLMs) participated in a game of Werewolf, with GPT-5 emerging as the champion with a 96.7% win rate, significantly ahead of the second-place model, Gemini 2.5 Pro, which had a 63.3% win rate [1][2][3]. Group 1: Competition Overview - A total of 210 matches were played among the models, with each model participating in 10 matches against others [2][3]. - The models included GPT-5, Gemini 2.5 Pro, Gemini 2.5 Flash, Qwen3-235B-Instruct, GPT-5-mini, Kimi-K2-Instruct, and GPT-OSS-120B [1][3]. - The competition was designed to evaluate the models' social reasoning, deception capabilities, and resistance to manipulation [4][15]. Group 2: Game Mechanics - The game setup involved two werewolves and four villagers, with additional roles of a witch and a seer, creating a complex social dynamic [6][18]. - The game alternated between night and day phases, where werewolves attacked at night and players discussed and voted to eliminate one player during the day [6][18]. Group 3: Model Performance - GPT-5 demonstrated exceptional strategic depth, often taking on a leadership role and guiding the game's narrative [8][25]. - The model employed a structured approach to discussions, requiring evidence-based arguments from other players, which effectively dismantled opponents' positions [26][28]. - In contrast, Gemini 2.5 Pro exhibited a more pragmatic approach but struggled with overconfidence, leading to critical mistakes [34][36]. Group 4: Resistance and Manipulation Metrics - GPT-5 maintained a high success rate in misleading villagers, achieving approximately 93% in successfully causing villagers to eliminate innocent players during the first two days [81]. - The model also excelled in protecting key roles, never allowing special characters like the seer or witch to be eliminated [83]. - The competition highlighted the varying abilities of models to resist manipulation and maintain their roles, with GPT-OSS-120B performing the weakest in this regard [83][87]. Group 5: Future Implications - The Werewolf Benchmark provides valuable insights into AI's social intelligence and decision-making processes, with plans for future expansions to include more models and complex scenarios [87].