Core Insights - The article highlights the rapid growth in the usage of large model assistants in China, with over 100 million daily users, marking a 900% increase since April last year [3] - A comprehensive evaluation of 14 large models was conducted, focusing on their performance in everyday work-related tasks rather than programming or deep research [3][5] - The evaluation involved blind assessments of the models' responses to various prompts, revealing differences in their capabilities and user experiences [5][8] Model Performance Summary - The evaluation included models from companies like OpenAI, Anthropic, Google, and several Chinese firms, with most models priced around $20 per month [4] - ChatGPT received the highest scores in the blind assessments, followed by StepFun and SenseNova, while MiniMax Agent scored the lowest due to its simplistic approach [8][13] - The models were tested on their ability to handle complex tasks, such as role-playing and brainstorming, with varying degrees of success [6][7] User Interaction and Feedback - Users reported that while the models showed improvements in their capabilities, the practical experience did not always align with the benchmark scores advertised by the companies [3][5] - The models were assessed on their ability to provide coherent and contextually relevant responses, with some models struggling with longer contexts or complex queries [8][23] Long Text Processing and Document Handling - The models were tested on their ability to process long documents, with none achieving perfect results, indicating ongoing challenges in this area [23][25] - Gemini and Yuanbao performed relatively well in extracting participant information from a lengthy conference manual, but issues like hallucinations and incomplete data were noted [25][26] Search and Information Retrieval - The article discusses the models' capabilities in replacing traditional search engines, with some models successfully retrieving specific articles and documents, while others struggled [53][60] - ChatGPT and Kimi excelled in finding relevant content, while models like DeepSeek and Qwen failed to provide accurate links or information [69] Conclusion - The evaluation indicates that while large models have made significant strides in user engagement and task performance, there are still notable gaps in their practical application and reliability [3][5][23]
查资料、劝老板、写周报,给上班人准备的大模型评测