Workflow
多模态大型语言模型(MLLM)
icon
Search documents
万字长文总结多模态大模型最新进展(Modality Bridging篇)
自动驾驶之心· 2025-11-15 03:03
Core Insights - The article discusses the emergence of Multimodal Large Language Models (MLLMs) as a significant research focus, highlighting their capabilities in performing multimodal tasks such as story generation from images and mathematical reasoning without OCR, indicating a potential pathway towards general artificial intelligence [2][4]. Group 1: MLLM Architecture and Training - MLLMs typically undergo large-scale pre-training on paired data to align different modalities, using datasets like image-text pairs or automatic speech recognition (ASR) datasets [2]. - The Perceiver Resampler module maps variable-sized spatiotemporal visual features from a vision encoder to a fixed number of visual tokens, reducing computational complexity in visual-text cross-attention [6][8]. - The training process involves a two-phase strategy: the first phase focuses on visual-language representation learning from frozen image encoders, while the second phase guides visual-to-language generation learning from frozen LLMs [22][24]. Group 2: Instruction Tuning and Data Efficiency - Instruction tuning is crucial for enhancing the model's ability to follow user instructions, with the introduction of learned queries that interact with both visual and textual features [19][26]. - The article emphasizes the importance of diverse and high-quality instruction data to improve model performance across various tasks, including visual question answering (VQA) and OCR [44][46]. - Data efficiency experiments indicate that reducing the training dataset size can still maintain high performance, suggesting potential for further improvements in data utilization [47]. Group 3: Model Improvements and Limitations - LLaVA-NeXT shows improvements in reasoning, OCR, and world knowledge, surpassing previous models in several benchmarks [40]. - Despite advancements, limitations remain, such as the model's inability to handle multiple images effectively and the potential for generating hallucinations in critical applications [39][46]. - The article discusses the need for efficient sampling methods and the balance between data annotation quality and model processing capabilities to mitigate hallucinations [48].
「一只手有几根手指」,你的GPT-5答对了吗?
机器之心· 2025-08-11 10:40
Core Viewpoint - The article discusses the limitations of advanced language models like GPT-5 in understanding basic visual concepts, highlighting the need for vision-centric models to improve visual comprehension and reasoning capabilities [2][26]. Group 1 - Tairan He points out that while language is a powerful tool, it struggles to fully meet the needs of the visual and robotics fields [2]. - There is a call for the development of vision-centric language models (VLM) and vision-language-action (VLA) models to address these shortcomings [3]. - The ambiguity in the definition of "fingers" illustrates the challenges language models face in interpreting visual information accurately [4][6]. Group 2 - The article mentions that even top models like Gemini 2.5 Pro have failed to provide correct answers to basic questions, indicating a lack of robust visual understanding [10][24]. - Tairan He references a paper by the Sseynin team that proposes a rigorous evaluation method for assessing the visual capabilities of multimodal large language models (MLLM) [28]. - The new benchmark test, CV-Bench, focuses on evaluating models' abilities in object counting, spatial reasoning, and depth perception, establishing stricter assessment standards [31]. Group 3 - Research shows that while advanced VLMs can achieve 100% accuracy in recognizing common objects, their performance drops to about 17% when dealing with counterfactual images [33]. - The article emphasizes that VLMs rely on memorized knowledge rather than true visual analysis, which limits their effectiveness [34]. - Martin Ziqiao Ma argues that initializing VLA models with large language models is a tempting but misleading approach, as it does not address fundamental perception issues [36].
从Figma到中国垂类应用全球崛起
Ge Long Hui· 2025-08-01 05:44
Group 1 - Figma is positioned as a major player in the design productivity revolution, targeting a $33 billion full-process product development ecosystem, having started from a $2.2 billion front-end design software market [2] - The core product, Figma platform, leverages lightweight design, community proliferation, and collaborative work as its key advantages in the global design tools market [2] - Figma is integrating AI programming capabilities to enhance collaboration in coding development, with future potential for "no-code development" [4] Group 2 - The global AI application landscape is on the verge of significant breakthroughs, particularly with the evolution of multimodal large language models (MLLMs) that can process various types of data beyond text [5] - The monetization potential of multimodal applications is proving to be superior to that of pure text products, with companies like OpenAI and Anthropic achieving substantial annual recurring revenue (ARR) through general models [7] - In the multimodal application space, domestic companies in China have made significant strides in video generation, with firms like Meitu, Kuaishou, and Ruqi Software achieving over $100 million in annual revenue [8] Group 3 - The investment logic highlights the premium opportunities for technology going overseas, as international users show a higher willingness to pay for AI services compared to domestic users [9] - Companies like Kuaishou have seen an 8.7% overseas subscription conversion rate, three times that of domestic rates, indicating a strong growth engine in international markets [9] - Figma's comprehensive coverage of the design and development process creates a competitive ecosystem, while domestic firms need to establish dual barriers of "AI + industry know-how" in vertical fields [10] Group 4 - The rise of Figma and the explosion of multimodal models signify a paradigm shift in productivity tools, necessitating both foundational architecture innovation and deep dissection of vertical scenarios by AI entrepreneurs [11] - The Chinese government's support for AI application development through initiatives like the "Digital China Construction 2025 Action Plan" provides a favorable environment for industry growth [10]
多模态大模型崛起:华泰证券预测应用奇点即将到来
Sou Hu Cai Jing· 2025-07-13 23:44
Core Insights - The report by Huatai Securities highlights the rapid development of multimodal large models (MLLM) and their applications, indicating that the field is approaching a critical turning point [1][4][15] Development Dynamics - MLLM is seen as an inevitable trend in the evolution of large language models (LLM), integrating capabilities from various modalities to expand application scenarios [1][6] - MLLM can be categorized into modular architecture and native architecture, with the latter showing significant advantages in performance and efficiency, albeit with higher computational and technical requirements [1][6] Commercialization Trends - Global progress in multimodal applications is faster overseas than domestically, with first-tier companies advancing more rapidly than second-tier companies, and multimodal products outpacing text-based products in commercialization [1][7] - Overseas chatbot products, such as those from OpenAI and Anthropic, have achieved annual recurring revenue (ARR) exceeding $1 billion, while domestic chatbot commercialization remains in its early stages [1][7] Video Generation Sector - Domestic companies excel in the video generation field, with products like ByteDance's Seedance 1.0 and Kuaishou's Kling achieving significant market presence [2][8] - Kuaishou's Kling reached an ARR of over $100 million within approximately 10 months of launch, marking a significant milestone in the domestic video generation sector [2][8] Future Outlook - The report anticipates that the singularity of multimodal large models and applications is approaching, driven by technological advancements and accelerated commercialization [5][15] - The integration of multimodal data processing will greatly expand AI's application scenarios, facilitating large-scale applications across various fields [4][15] Investment Opportunities - The report suggests potential investment opportunities in both computational power and application sectors, highlighting the demand for computational resources in native multimodal models and the growing AI needs in advertising, retail, and creative industries [9]