多模态输入 - filings, earnings calls, financial reports, news

多模态输入

Search documents

歸藏的AI工具箱· 2025-12-02 05:18

Core Viewpoint - The article introduces the launch of 可灵's O1, a unified video and image generation and editing tool that integrates multiple tasks into a single interface, allowing for seamless video and image editing and generation. Group 1: Features of O1 - O1 integrates multi-modal video models, combining reference videos, text-to-video, frame manipulation, content addition/removal, and style redrawing into a one-stop solution for generation and modification [2]. - It supports multi-modal inputs including images, videos, subjects, and text, enabling precise editing through natural language without the need for masks or keyframes [2][4]. - The tool maintains consistency in character, props, and scene features across shots through multi-angle subjects and reference materials, ensuring coherent visuals [2]. Group 2: Editing Capabilities - Users can generate narrative shots lasting approximately 3 to 10 seconds, allowing for flexible control over pacing and shot length [2]. - The editing process allows for direct modifications through text prompts, where users can upload videos and specify changes using references [4][6]. - O1 supports the use of single or multiple reference images for background or character modifications, enhancing the realism of the final output [7]. Group 3: Subject Creation and Consistency - O1 introduces a new element called "subject," which allows users to create and select characters for easier integration into videos without frequent uploads [10][13]. - Users can upload multiple images from different angles to improve consistency in character and scene representation during video generation [13][17]. - The tool is particularly beneficial for e-commerce, as it ensures that products remain consistent in appearance during various camera movements [17]. Group 4: Style and Frame Generation - O1 allows users to convert video styles easily, supporting various artistic styles such as felt, anime, and 8-bit pixel [19]. - The tool also supports frame generation, enabling users to create complex effects by combining image references with frame inputs [20][21]. - The overall capabilities of O1 in video editing are seen as a significant advancement, with the potential for creating impressive effects with minimal effort [29].

智谱GLM-4.1V-Thinking登顶HuggingFace Trending全球第一：同尺寸效果最好

IPO早知道· 2025-07-09 10:01

Core Viewpoint - The GLM-4.1V-9B-Thinking model represents a significant leap from perception to cognition in the GLM series of visual models, showcasing advanced capabilities in multi-modal reasoning and understanding [1][5]. Model Performance - GLM-4.1V-9B-Thinking has achieved the top position on HuggingFace Trending, leveraging its 9 billion parameters to excel in various tasks [2]. - The model has outperformed larger models, achieving the best results in 23 out of 28 authoritative evaluations, including MMStar and MMMU-Pro, demonstrating the potential of smaller models [4]. Multi-Modal Capabilities - The model supports a wide range of multi-modal inputs, including images, videos, and documents, and is designed for complex cognitive tasks [4]. - Key capabilities include: - Video understanding: Analyzing up to two hours of video content for time, characters, events, and logical relationships [4]. - Image question answering: Deep analysis and reasoning about content within images [5]. - Subject problem-solving: Providing detailed reasoning for problems in subjects like mathematics and science [5]. - Text recognition: Accurate extraction and structuring of text and charts from images and videos [5]. - Document interpretation: Understanding and extracting information from documents in finance, government, and education [5]. - Grounding: Identifying specific areas in images and extracting coordinates for downstream tasks [5]. - GUI agent capabilities: Recognizing and interacting with elements on web and mobile interfaces [5]. - Code generation: Automatically writing front-end code based on input image text [5].