大模型议会(LLM Council)
Search documents
卡帕西大模型横评方法太好玩了!四大AI匿名参赛评分,最强出乎意料
量子位· 2025-11-23 04:09
Core Insights - The article discusses the launch of a new web app called "LLM Council" by Karpathy, which allows multiple large language models (LLMs) to collaborate and provide answers to user queries [1][2][3] Group 1: Application Overview - The LLM Council app mimics a chat interface similar to ChatGPT, but it engages multiple models to discuss and answer questions collectively [2] - The process involves three main steps: simultaneous responses from multiple models, anonymous peer evaluations among the models, and a final answer compiled by a designated chair model [7][12][13] Group 2: Model Evaluation Process - In the first step, various models respond to a question, and their answers are displayed for user review [7] - The second step involves anonymous evaluations where each model assesses the quality of responses from others based on accuracy and insight [8][10] - Finally, a chair model consolidates the evaluations and responses to provide a unified answer to the user [12][13] Group 3: Insights on Model Performance - Karpathy noted that the models generally agreed on performance rankings, with GPT-5.1 being rated the best and Claude the weakest, while Gemini 3 and Grok-4 fell in between [21] - Despite the rankings, Karpathy expressed differing opinions on the models' strengths, highlighting that GPT-5.1 was rich in content but lacked structure, while Gemini 3 was more concise [23] - The models displayed minimal bias and were willing to acknowledge when another model provided a better answer, indicating a potential for multi-model integration to be a significant area for exploration in future LLM products [24]