「一只手有几根手指」，你的GPT-5答对了吗？

Core Viewpoint - The article discusses the limitations of advanced language models like GPT-5 in understanding basic visual concepts, highlighting the need for vision-centric models to improve visual comprehension and reasoning capabilities [2][26]. Group 1 - Tairan He points out that while language is a powerful tool, it struggles to fully meet the needs of the visual and robotics fields [2]. - There is a call for the development of vision-centric language models (VLM) and vision-language-action (VLA) models to address these shortcomings [3]. - The ambiguity in the definition of "fingers" illustrates the challenges language models face in interpreting visual information accurately [4][6]. Group 2 - The article mentions that even top models like Gemini 2.5 Pro have failed to provide correct answers to basic questions, indicating a lack of robust visual understanding [10][24]. - Tairan He references a paper by the Sseynin team that proposes a rigorous evaluation method for assessing the visual capabilities of multimodal large language models (MLLM) [28]. - The new benchmark test, CV-Bench, focuses on evaluating models' abilities in object counting, spatial reasoning, and depth perception, establishing stricter assessment standards [31]. Group 3 - Research shows that while advanced VLMs can achieve 100% accuracy in recognizing common objects, their performance drops to about 17% when dealing with counterfactual images [33]. - The article emphasizes that VLMs rely on memorized knowledge rather than true visual analysis, which limits their effectiveness [34]. - Martin Ziqiao Ma argues that initializing VLA models with large language models is a tempting but misleading approach, as it does not address fundamental perception issues [36].