Workflow
多模态输入
icon
Search documents
字节发布Seedance 2.0,豆包、即梦官宣接入
Huan Qiu Wang· 2026-02-12 08:45
Core Insights - ByteDance has launched its latest video generation model, Seedance 2.0, which is integrated into its AI products Doubao and Jimeng, allowing users to create AI videos using their digital avatars [1][2] - The model supports four modalities: image, video, audio, and text, enhancing the creative process by allowing users to specify styles, actions, and atmospheres more intuitively [1][5] Group 1 - Seedance 2.0 has been tested in a limited scope and has garnered global attention for its multi-modal capabilities and precision [2] - An overseas creator compared the output of Seedance 2.0 with previous models, noting a significant improvement in realism and richness, which even caught the attention of Elon Musk [2] - Users from abroad are reportedly seeking ways to obtain Chinese phone numbers to access Seedance 2.0, indicating high demand [2] Group 2 - The CEO of Game Science, Feng Ji, praised Seedance 2.0 as the "strongest video generation model" currently available, highlighting its advancements in multi-modal information understanding and integration [5] - The official technical report indicates that Seedance 2.0 employs a sparse architecture to enhance training and inference efficiency, showcasing strong generalization capabilities [5] - The model excels in generating high-quality audio-visual content, supporting complex functions such as multi-modal references, video editing, and motion stability, with significant improvements in visual aesthetics and narrative control [5]
计算机:字节Seedance2.0发布,“杀死比赛”级产品惊艳亮相
GOLDEN SUN SECURITIES· 2026-02-10 08:24
Investment Rating - The report maintains an "Accumulate" rating for the industry [4] Core Insights - The release of Byte's Seedance 2.0 marks a significant advancement in video generation technology, supporting multi-modal inputs (image, video, audio, text) and enhancing controllability, which is expected to drive industrialization in the video generation sector [1][21] - The traditional "gacha" mechanism in the AI drama industry, which incurs additional costs due to the randomness of AI-generated content, is addressed by Seedance 2.0's improved controllability, potentially reducing production costs and timelines [2][22] - Sensitivity analysis indicates that adopting Seedance 2.0 could lower total generation costs by approximately 5% under conservative assumptions and by 37% under neutral assumptions, compared to industry peers [2][30] Summary by Sections 1. Seedance 2.0 Release - Seedance 2.0 has been launched with enhanced capabilities, supporting four types of inputs and improving instruction understanding, physical realism, and element consistency [1][11] - The model's controllability allows for precise replication of complex camera movements and actions, enhancing the overall video generation process [17][19] 2. Implications for the Industry - The significant improvement in controllability is expected to lead to a leap in the industrialization of video generation, reducing the costs associated with the "gacha" mechanism prevalent in the AI-generated content industry [22] - The report highlights that Seedance 2.0 can effectively lower production costs and timelines in the AI drama sector by minimizing the need for multiple attempts to achieve satisfactory results [2][22] 3. Investment Recommendations - The report suggests focusing on companies involved in AI drama such as Wanjing Technology, Kuaishou, Fubo Group, and New Guodu, as well as multi-modal companies like Danghong Technology and Hongruan Technology [3][30]
视频进入可编辑时代:藏师傅教你视频版 Banana 可灵 O1
歸藏的AI工具箱· 2025-12-02 05:18
Core Viewpoint - The article introduces the launch of 可灵's O1, a unified video and image generation and editing tool that integrates multiple tasks into a single interface, allowing for seamless video and image editing and generation. Group 1: Features of O1 - O1 integrates multi-modal video models, combining reference videos, text-to-video, frame manipulation, content addition/removal, and style redrawing into a one-stop solution for generation and modification [2]. - It supports multi-modal inputs including images, videos, subjects, and text, enabling precise editing through natural language without the need for masks or keyframes [2][4]. - The tool maintains consistency in character, props, and scene features across shots through multi-angle subjects and reference materials, ensuring coherent visuals [2]. Group 2: Editing Capabilities - Users can generate narrative shots lasting approximately 3 to 10 seconds, allowing for flexible control over pacing and shot length [2]. - The editing process allows for direct modifications through text prompts, where users can upload videos and specify changes using references [4][6]. - O1 supports the use of single or multiple reference images for background or character modifications, enhancing the realism of the final output [7]. Group 3: Subject Creation and Consistency - O1 introduces a new element called "subject," which allows users to create and select characters for easier integration into videos without frequent uploads [10][13]. - Users can upload multiple images from different angles to improve consistency in character and scene representation during video generation [13][17]. - The tool is particularly beneficial for e-commerce, as it ensures that products remain consistent in appearance during various camera movements [17]. Group 4: Style and Frame Generation - O1 allows users to convert video styles easily, supporting various artistic styles such as felt, anime, and 8-bit pixel [19]. - The tool also supports frame generation, enabling users to create complex effects by combining image references with frame inputs [20][21]. - The overall capabilities of O1 in video editing are seen as a significant advancement, with the potential for creating impressive effects with minimal effort [29].
智谱GLM-4.1V-Thinking登顶HuggingFace Trending全球第一:同尺寸效果最好
IPO早知道· 2025-07-09 10:01
Core Viewpoint - The GLM-4.1V-9B-Thinking model represents a significant leap from perception to cognition in the GLM series of visual models, showcasing advanced capabilities in multi-modal reasoning and understanding [1][5]. Model Performance - GLM-4.1V-9B-Thinking has achieved the top position on HuggingFace Trending, leveraging its 9 billion parameters to excel in various tasks [2]. - The model has outperformed larger models, achieving the best results in 23 out of 28 authoritative evaluations, including MMStar and MMMU-Pro, demonstrating the potential of smaller models [4]. Multi-Modal Capabilities - The model supports a wide range of multi-modal inputs, including images, videos, and documents, and is designed for complex cognitive tasks [4]. - Key capabilities include: - Video understanding: Analyzing up to two hours of video content for time, characters, events, and logical relationships [4]. - Image question answering: Deep analysis and reasoning about content within images [5]. - Subject problem-solving: Providing detailed reasoning for problems in subjects like mathematics and science [5]. - Text recognition: Accurate extraction and structuring of text and charts from images and videos [5]. - Document interpretation: Understanding and extracting information from documents in finance, government, and education [5]. - Grounding: Identifying specific areas in images and extracting coordinates for downstream tasks [5]. - GUI agent capabilities: Recognizing and interacting with elements on web and mobile interfaces [5]. - Code generation: Automatically writing front-end code based on input image text [5].