Workflow
多模态大模型
icon
Search documents
南大联合LibLib.ai、中科院自动化所,共同提出布局推理与精准编辑「海报设计大模型」PosterCopilot
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the development of PosterCopilot, a professional-level poster design and editing model that addresses significant challenges in graphic design automation, particularly in layout reasoning and controllable editing [2][6][40]. Industry Pain Points - Graphic design faces substantial challenges in achieving true automation, with existing models like Stable Diffusion struggling with layered structures, leading to material distortion and lack of fine control [6]. - Current multimodal models exhibit four critical shortcomings: severe element overlap, lack of visual feedback, regression to a single ground truth, and inability to perform layer-specific edits [8][10]. Core Achievements - PosterCopilot aims to bridge the gap between single-step generation and professional workflows through a systematic solution that incorporates a three-stage training strategy [13][14]. - The innovative three-stage training includes: 1. Perturbation Supervised Fine-Tuning (PSFT) to address geometric distortions [15]. 2. Visual-Reality Alignment Reinforcement Learning (RL-VRA) to correct overlaps and proportional issues [15]. 3. Aesthetic Feedback Reinforcement Learning (RLAF) to encourage exploration beyond ground truth layouts [15]. Generative Agent - PosterCopilot functions as a comprehensive design assistant, facilitating seamless transitions from abstract design concepts to concrete materials through a reception model and T2I model [16][17]. - The model supports various professional scenarios, including full poster generation from provided assets, intelligent completion of missing materials, global theme transitions, intelligent size reconstruction, and multi-round fine-grained editing [21][23][28][29][31]. Experimental Results - PosterCopilot outperforms existing commercial competitors and state-of-the-art models across multiple metrics, achieving an average win rate exceeding 74% in human evaluations [34][35]. - In assessments of layout rationality, text legibility, and element preservation, PosterCopilot demonstrates superior performance compared to models like Microsoft Designer and CreatiPoster [35][37]. Conclusion and Outlook - By decoupling layout reasoning from generative editing and incorporating reinforcement learning to align with human aesthetics, PosterCopilot sets a new benchmark for intelligent design tools and offers a new paradigm for AI-assisted creative workflows [40].
智谱上线并开源GLM-4.6V系列多模态大模型 构建原生多模态工具调用能力
Zheng Quan Ri Bao Wang· 2025-12-09 10:46
本报讯 (记者梁傲男)12月8日,北京智谱华章科技股份有限公司(以下简称"智谱")正式上线并开源 GLM-4.6V系列多模态大模型,包括面向云端与高性能集群场景的基础版GLM-4.6V(106B-A12B)和面 向本地部署与低延迟应用的轻量版GLM-4.6V-Flash(9B)。 智谱方面表示:"智谱多模态开源周开启,我们将持续开源更多前沿模型。拥抱多模态交互新范式,从 GLM-4.6V开始。" 传统工具调用大多基于纯文本,在面对图像、视频、复杂文档等多模态内容时,需要多次中间转换,带 来信息损失和工程复杂度。 据了解,GLM-4.6V从设计之初就围绕"图像即参数,结果即上下文",构建了原生多模态工具调用能 力:图像、截图、文档页面等可以直接作为工具参数,无需先转为文字描述再解析,减少链路损耗。对 于工具返回的统计图表、渲染后网页截图、检索到的商品图片等结果,模型能够再次进行视觉理解,将 其纳入后续推理链路。 模型原生支持基于视觉输入的工具调用,完整打通从感知到理解到执行的闭环。这使得GLM-4.6V能够 应对图文混排输出、商品识别与好价推荐以及辅助型Agent场景等更复杂的视觉任务。 据介绍,GLM-4.6 ...
全图与切片并非等价?LLaVA-UHD-v3揭示差异推出高效全图建模方案
机器之心· 2025-12-09 03:17
Core Insights - The article discusses the advancements in multimodal large models (MLLMs) and the introduction of LLaVA-UHD v3, which addresses the challenge of efficiently processing high-resolution images while maintaining global understanding capabilities [2][3][10]. Group 1: Introduction of LLaVA-UHD v3 - LLaVA-UHD v3 introduces a new progressive visual compression framework (PVC) that consists of two core components: Refined Patch Embedding (RPE) and Windowed Token Compression (WTC) [4][10]. - The PVC framework significantly reduces the number of visual tokens while preserving global semantic consistency, enhancing the efficiency of native high-resolution visual encoding [4][10]. Group 2: Comparison of Encoding Methods - The research team conducted a fair comparison between slice-based encoding (SBE) and global native-resolution encoding (GNE) using the same model architecture, training data, and evaluation protocols [5]. - GNE demonstrated a notable advantage in spatial perception and localization tasks, with an average improvement of approximately 11.0% over SBE [6]. - In general visual-language understanding tasks, GNE outperformed SBE by about 2.1%, indicating that GNE is more suitable for tasks requiring spatial awareness and high-resolution understanding [7]. Group 3: Efficiency and Performance of LLaVA-UHD v3 - The PVC architecture allows for a significant reduction in computational load while maintaining model capabilities, achieving a 2.4× acceleration compared to MoonViT and 1.9× faster than Qwen2.5-ViT [16]. - LLaVA-UHD v3 was trained on approximately 20 million image-text pairs, which is significantly lower than competitors like Qwen2-VL (700 million) and MiniCPM-V2.6 (460 million), yet it remains highly competitive across various visual-language benchmarks [17]. - The model achieved a visual token compression rate of 64×, surpassing competitors, while still performing comparably or better in tasks requiring fine-grained visual information [17]. Group 4: Future Directions - The article emphasizes the need for further exploration of visual encoding pre-training strategies suitable for multimodal tasks and the gradual introduction of linear complexity operators to replace traditional quadratic complexity attention mechanisms [20].
智谱上线并开源 GLM-4.6V 系列多模态大模型
Bei Jing Shang Bao· 2025-12-08 12:34
北京商报讯(记者 魏蔚)12月8日,智谱正式上线并开源 GLM-4.6V 系列多模态大模型,包括面向云端 与高性能集群场景的基础版GLM-4.6V(106B-A12B)和面向本地部署与低延迟应用的轻量版GLM- 4.6V-Flash(9B)。 据介绍,GLM-4.6V 将训练时上下文窗口提升到 128k tokens,在视觉理解精度上达到同参数规模 SOTA,首次在模型架构中将 Function Call(工具调用)能力原生融入视觉模型,打通从"视觉感 知"到"可执行行动(Action)"的链路,为真实业务场景中的多模态 Agent (智能体)提供统一的技术底 座。该系列模型较GLM-4.5V 降价 50%,API (应用程序编程接口)调用价格为输入 1 元/百万 tokens, 输出 3 元/百万 tokensm,其中GLM-4.6V-Flash 免费供用户使用。GLM-4.6V 融入 GLM Coding Plan,针 对用户 8 类场景定向开发了专用 MCP(大模型上下文协议) 工具。 ...
死磕技术的自动驾驶黄埔军校,又更新了这些技术进展......
自动驾驶之心· 2025-12-07 02:05
Core Insights - The article emphasizes the importance of a comprehensive community for autonomous driving, aiming to provide a platform for knowledge sharing and networking among industry professionals and academic experts [8][25][29]. Community Development - The "Autonomous Driving Heart Knowledge Planet" has been established to facilitate discussions on technology, trends, and changes in the autonomous driving sector, with over 4,000 members and a goal to reach nearly 10,000 in the next two years [8][11]. - The community offers a variety of resources, including videos, articles, learning paths, and job exchange opportunities, making it a valuable hub for both beginners and advanced learners [8][11][12]. Technical Resources - The community has compiled over 40 technical routes covering various aspects of autonomous driving, such as end-to-end learning, multi-modal models, and sensor fusion, which significantly reduces the time needed for research [11][25]. - Members can access detailed information on the latest advancements in autonomous driving technologies, including world models, VLA (Vision Language Models), and 3D target detection [25][49][51]. Job Opportunities - The community provides job referral mechanisms with various autonomous driving companies, ensuring members can connect with potential employers quickly [17][25]. - Regular updates on job openings and industry trends are shared, helping members stay informed about career opportunities in the autonomous driving field [30][100]. Educational Content - The community offers a structured learning path for newcomers, including foundational courses in mathematics, computer vision, and deep learning, tailored for those with no prior experience [19][25]. - Members can participate in live discussions and Q&A sessions with industry leaders, enhancing their understanding of current challenges and innovations in autonomous driving [12][92].
Ilya刚预言完,世界首个原生多模态架构NEO就来了:视觉和语言彻底被焊死
3 6 Ke· 2025-12-05 07:06
Core Insights - The AI industry is experiencing a paradigm shift as experts like Ilya Sutskever declare that the era of merely scaling models is over, emphasizing the need for smarter architectures rather than just larger models [1][26] - A new native multimodal architecture called NEO has emerged from a Chinese research team, which aims to fundamentally disrupt the current modular approach to AI models [1][5] Group 1: Current State of Multimodal Models - Traditional multimodal models, such as GPT-4V and Claude 3.5, primarily rely on a modular approach that connects pre-trained visual encoders to language models, resulting in a lack of deep integration between visual and language processing [3][6] - The existing modular models face three significant technical gaps: efficiency, capability, and fusion, which hinder their performance in complex tasks [6][7][8] Group 2: NEO's Innovations - NEO introduces a unified model that integrates visual and language processing from the ground up, eliminating the distinction between visual and language modules [8][24] - The architecture features three core innovations: Native Patch Embedding, Native-RoPE for spatial encoding, and Native Multi-Head Attention, which enhance the model's ability to understand and process multimodal information [11][14][16] Group 3: Performance Metrics - NEO demonstrates remarkable data efficiency, achieving comparable or superior performance to leading models while using only 3.9 million image-text pairs for training, which is one-tenth of what other top models require [19][20] - In various benchmark tests, NEO has outperformed other native vision-language models, showcasing its superior performance across multiple tasks [21][22] Group 4: Implications for the Industry - NEO's architecture not only improves performance but also lowers the barriers for deploying multimodal AI in edge devices, making advanced visual understanding capabilities accessible beyond cloud-based models [23][24] - The open-sourcing of NEO's architecture signals a shift in the AI community towards more efficient and unified models, potentially setting a new standard for multimodal technology [24][25]
赛道分化加剧,2026年人工智能最强风口来袭
3 6 Ke· 2025-12-03 08:57
Core Insights - The article emphasizes that 2026 will be a pivotal year for artificial intelligence (AI), marking a shift from "AI+" to "AI native," where AI fundamentally redefines system architectures and operational logic [1][3]. Group 1: AI Native Revolution - AI native signifies a complete redesign of systems with AI as the core logic and capability, leading to a comprehensive transformation across technology architecture, business processes, organizational roles, and value creation methods [3][4]. - The transition from "AI+" to "AI native" is not merely an enhancement but a fundamental restructuring that makes intelligence an inherent attribute of applications rather than an added feature [3][4]. - Key characteristics of a true AI native system include natural language interaction, autonomous learning and adaptation, and the ability to complete tasks independently based on large language models and knowledge bases [4][5]. Group 2: Development Trends and Tools - The rise of low-code/no-code platforms allows individuals without programming skills to create custom AI tools, fostering a surge in "one-person company" models [8]. - Major companies like Microsoft and ByteDance are embedding AI agents into office suites, creating end-to-end workflows that enhance productivity [8]. - The development of AI native applications requires a productized approach to various tools, such as platforms for deploying large models and automated fine-tuning tools, which are essential for widespread adoption [8]. Group 3: Physical AI Integration - By 2026, AI will extend beyond screens into physical environments like cities, factories, hospitals, and homes, marking the era of Physical AI [10][11]. - Physical AI is characterized by its ability to connect digital and physical worlds, enabling actions based on real-time data and physical interactions [10][11]. - The evolution of AI has progressed through three stages: perceptual AI, generative AI, and now Physical AI, which can reason, plan, and act like humans [10][11]. Group 4: World Models and Their Impact - World models are becoming crucial for AI's integration into the real world, allowing AI to shift from data-driven to rule-driven approaches, enabling predictive decision-making [19][21]. - These models enhance generalization capabilities, allowing AI to apply learned knowledge to new, unseen scenarios, which is vital for applications like autonomous driving [22][23]. - The development of world models involves understanding physical laws and simulating environments, which can significantly improve the performance of AI systems in complex real-world situations [24][25]. Group 5: Multimodal AI Capabilities - The emergence of multimodal large models (MLLMs) will redefine industries by enabling AI to process and integrate various data types, such as text, images, and audio [15][17]. - MLLMs will enhance cross-modal understanding and generation, allowing for more sophisticated content creation and problem-solving capabilities [15][16]. - By 2026, MLLMs are expected to drive significant advancements across various sectors, including cultural heritage preservation, security, and intelligent driving [17][18].
国内首款AI助盲眼镜发布:300ms超低延迟 接入通义千问
Feng Huang Wang· 2025-12-03 07:14
Core Viewpoint - Hangzhou Tongxing Technology has launched China's first AI-assisted glasses for the visually impaired, utilizing multimodal large models to address navigation challenges faced by this demographic, particularly the "last mile" issue in travel [1] Group 1: Product Features - The AI glasses are equipped with a 121-degree ultra-wide dual-camera and consist of four components: the glasses, a mobile phone, a remote control ring, and a cane [1] - The system achieves ultra-low latency of 300ms in obstacle avoidance scenarios, allowing real-time environmental analysis and road prompts as the user moves [1] - In reading menus or locating stores, the large model switches strategies to provide detailed summaries and broadcasts of text and environmental details [1] Group 2: Market Context - There are over 17 million visually impaired individuals in China, who often rely heavily on human assistance for travel due to a lack of efficient tools beyond traditional canes [1] - The market and technology director of Hangzhou Tongxing Technology, Chen Gang, stated that advancements in large model technology have significantly reduced computing costs to one-tenth of previous levels [1] - The company employs a "base model reuse + fine-tuning optimization" approach, enabling rapid implementation of complex features like voice assistants and emergency assistance at a lower cost [1] Group 3: Product Launch - The AI-assisted glasses have officially entered the market, marking a significant development in assistive technology for the visually impaired [1]
CES2026超前瞻:AI是核心议题,中国企业或将再度霸展
3 6 Ke· 2025-12-01 04:09
Core Insights - CES 2026 is set to showcase significant advancements in AI technology, with major companies like Siemens, Caterpillar, AMD, and Lenovo focusing on AI in their presentations [5][8][19] - The event will highlight a variety of AI hardware products, including AI glasses, AI PCs, AI smartphones, and humanoid robots, indicating a strong trend towards AI integration in consumer electronics [18][19] - Chinese brands are expected to dominate CES, showcasing their technological innovations across various categories, reflecting their growing influence in the global market [40][41] AI as the Central Theme - AI will be the overarching theme of CES 2026, with confirmed keynote speeches from industry leaders emphasizing its importance [5][19] - Companies like Siemens will demonstrate how AI and digital twin technology can transform manufacturing and infrastructure [8] - Lenovo plans to unveil innovations related to AI-driven experiences, including applications in sports and personalized user interactions [11] PC and Gaming Innovations - Intel, AMD, and NVIDIA are anticipated to launch new products, including Intel's Panther Lake mobile processors and AMD's R9 9950X3D processor with enhanced cache capabilities [19][21] - The introduction of new gaming processors and graphics cards is expected to attract significant attention from the gaming community [21][22] Display Technology Competition - Major TV manufacturers, including TCL and Hisense, are expected to showcase advancements in RGB display technology, competing with international brands like LG and Samsung [25][26] - The CES 2026 will feature a variety of display technologies, including Micro RGB LCD and Mini LED, highlighting the competitive landscape in the display sector [25][26] Smart Cleaning Devices - Chinese smart cleaning brands are set to unveil new products, including robotic vacuums and lawn mowers, reinforcing their leadership in the global smart cleaning market [27][30] - The focus will be on comprehensive cleaning solutions that leverage AI and advanced navigation technologies [30] Accessory and Audio Innovations - Accessory brands like Baseus and Ugreen are expected to expand their product lines beyond traditional charging devices, venturing into audio and smart home solutions [31][34] - The introduction of high-end audio products and smart home security devices will be a key focus for these brands at CES 2026 [36] AI Glasses and New Hardware - AI glasses are anticipated to be a major highlight, with various brands competing in this emerging category [38] - The presence of established players and new entrants in the AI hardware space will create a dynamic showcase of innovative products [39] Chinese Brands' Dominance - Chinese companies are projected to play a pivotal role at CES, with a significant share of exhibitors and a focus on technological innovation rather than just cost competitiveness [40][41] - The event serves as a platform for Chinese brands to demonstrate their rapid product development and engineering capabilities across multiple tech sectors [40][41]
图解Qwen3-VL多模态模型
自动驾驶之心· 2025-11-29 02:06
Core Insights - The article discusses the Qwen3-VL model, a visual language model (VLM) that processes both text and images as input, emphasizing its architecture and implementation details [3][4]. Group 1: Model Overview - Qwen3-VL is an autoregressive AI model designed to handle multimodal inputs, specifically text and images [3]. - The model's architecture includes various components such as configuration files, modeling files, and processing files for images and videos [5][6]. Group 2: Source Code Analysis - The source code of Qwen3-VL is structured into several classes, including Qwen3VLVisionMLP, Qwen3VLVisionPatchEmbed, and Qwen3VLForConditionalGeneration, each serving specific functions within the model [6][12]. - The Qwen3VLProcessor class converts input images into pixel values, utilizing the Qwen2-VL image processor for this task [7][10]. Group 3: Image Processing - The image processing function involves resizing, normalizing, and preparing images for input into the model, ultimately returning pixel values that serve as input [8][9]. - The model processes images in batches, grouping them by size for efficient resizing and normalization [9]. Group 4: Model Execution Flow - The Qwen3-VLForConditionalGeneration class serves as the entry point for the model, where input pixel values and text input IDs are processed to generate outputs [15][16]. - The model's forward method outlines the steps taken to integrate image and text features, including embedding images into the input sequence [21][22]. Group 5: Vision Encoder - The vision encoder of Qwen3-VL is custom-built, differing from existing models like CLIP, and utilizes a 3D convolution to convert images into hidden states [35][37]. - The encoder incorporates attention mechanisms and position encoding to enhance the model's ability to process visual data [40][41]. Group 6: Final Outputs - The final output of the model combines the processed image and text features, which are then forwarded to the language model for further processing [33][34]. - The architecture allows for the integration of visual and textual data, enabling the model to generate coherent outputs based on multimodal inputs [44].