多模态大模型(MLLMs)

Search documents
真实科研水平集体不及格!全新基准SFE给主流多模态LLM来了波暴击
机器之心· 2025-07-09 09:52
Core Insights - The article discusses the advancements in Artificial Intelligence for Science (AI4S) and the introduction of the Scientists' First Exam (SFE) to evaluate the capabilities of multimodal large language models (MLLMs) in scientific domains [1][3][12]. Group 1: AI4S and SFE Overview - AI4S has made significant progress in transforming scientific research through innovative tools, but to become a revolutionary tool, it requires a comprehensive approach integrating specialized knowledge [1]. - The SFE aims to systematically assess the cognitive abilities of MLLMs across various scientific disciplines, addressing the limitations of existing evaluation methods that focus primarily on knowledge recall [2][3]. Group 2: SFE Evaluation Framework - SFE introduces a three-tier evaluation framework: Signal Perception (L1), Attribute Understanding (L2), and Comparative Reasoning (L3), covering five scientific fields with 66 high-value tasks [4][10][12]. - The evaluation reveals that mainstream models perform well on traditional benchmarks but struggle significantly with high-level scientific tasks, with state-of-the-art models scoring around 30 [4][18]. Group 3: Performance Insights - The evaluation results indicate that closed-source MLLMs outperform open-source models by 6-8% on average, with notable differences in specific tasks [20]. - Materials science is identified as the strongest area for model performance, while astronomy presents more challenges due to the complexity of data [22][23]. Group 4: Model Development and Trends - Recent models show significant improvements in high-level reasoning tasks, while progress in understanding tasks remains limited, indicating a shift towards enhanced reasoning capabilities [25][26]. - The scaling of model size does not always correlate with improved scientific capabilities, suggesting the need for a balanced expansion of scientific data alongside model size [31][32]. Group 5: Future Directions and Ecosystem - The establishment of the SciPrismaX platform aims to create a rigorous and dynamic evaluation ecosystem for AI in science, incorporating various assessment dimensions and community collaboration [33][36].
GPT-Kline:MCoT与技术分析
HTSC· 2025-05-31 10:25
Investment Rating - The report does not explicitly state an investment rating for the industry or the specific technology discussed. Core Insights - The research explores the application of Multimodal Chain of Thought (MCoT) in investment research, particularly in technical analysis using K-line charts, leading to the development of an automated platform called GPT-Kline [1][4][13]. - MCoT enhances the reasoning capabilities of large models by combining multimodal understanding with logical reasoning, allowing for more sophisticated analysis of complex tasks [2][21]. - The O3 model, launched by OpenAI, demonstrates impressive image reasoning capabilities, marking a significant step towards achieving general artificial intelligence (AGI) [2][37]. Summary by Sections Multimodal Reasoning - Multimodal collaboration is essential for large models to progress towards AGI, requiring them to be proficient in various modalities beyond just language [17]. - MCoT represents a significant advancement, enabling models to think based on images rather than merely perceiving them [21][31]. Application in Investment Research - The report highlights the potential of MCoT in technical analysis, particularly with K-line charts, which encapsulate vital trading information and patterns suitable for analysis [3][42]. - The O3 model's application in technical analysis shows its ability to process K-line images, perform necessary pre-processing, and generate analytical reports [3][43]. Development of GPT-Kline - GPT-Kline integrates MCoT with the capabilities of large models to create a specialized tool for K-line technical analysis, automating the entire analysis process from drawing to reporting [4][65]. - The platform features a user-friendly web interface designed for intuitive interaction, allowing users to engage with the analysis process effectively [4][83]. Model Comparison and Performance - The report compares various large models, including OpenAI's GPT-4o and Gemini-2.5 series, assessing their capabilities in K-line analysis and identifying Gemini-2.5 Flash as a strong performer [66][96]. - The analysis results indicate that while OpenAI's models tend to be conservative in their outputs, the Gemini models provide more comprehensive and accurate annotations [95][96].
让GPT-4o准确率大降,这个文档理解新基准揭秘大模型短板
机器之心· 2025-05-24 04:07
Core Viewpoint - The article discusses the development of WildDoc, a benchmark dataset for real-world document understanding, highlighting the limitations of existing multimodal large models (MLLMs) in handling complex document scenarios [1][3][19]. Group 1: Limitations of Existing Models - Current MLLMs have shown significant performance drops when evaluated on WildDoc compared to traditional benchmarks like DocVQA, with models like GPT-4o experiencing an average accuracy decline of 35.3% [12][13]. - The existing benchmarks fail to simulate the complexities of real-world environments, leading to doubts about the models' performance in practical applications [5][11]. Group 2: WildDoc Dataset - WildDoc consists of over 12,000 manually captured images of documents, simulating challenges such as lighting, distortion, and varying angles, which are critical for assessing model robustness [3][7]. - The dataset introduces a consistency score metric to evaluate model stability across different conditions, revealing performance bottlenecks in current MLLMs [3][19]. Group 3: Experimental Findings - The experiments indicate that physical distortions (wrinkles, bends) are the most challenging factors for model performance, with GPT-4o's accuracy dropping by 34.1-34.7% under such conditions [13][16]. - Non-frontal angles and image quality significantly affect performance, while larger models do not necessarily overcome the challenges posed by real-world scenarios [13][16]. Group 4: Future Directions - The research team suggests several strategies for improving MLLMs, including data augmentation to simulate real-world conditions, robust feature learning to enhance model adaptability, and the incorporation of more real-world document images into training datasets [19].