digihuman-百度搜索重磅升级文心助手：8模态创作+实时数字人智能体重构AI交互生态

Core Insights - Baidu has officially announced a comprehensive upgrade of its Wenxin Assistant AIGC creative capabilities, introducing an eight-modal creative matrix that includes AI-generated images, videos, music, and podcasts, marking a significant shift from a search engine to a "comprehensive creation and service platform" [3] Group 1: Creative Capabilities - The upgraded Wenxin Assistant features a full-stack creative capability, with a key breakthrough in video generation, allowing for the creation of 3-minute long videos from a single text input, surpassing the previous 10-second limit of traditional AI video [3] - The platform integrates features such as "write a song in one sentence," "MV production," and "famous scene imitation," supported by over 30 special effect templates, creating a complete ecosystem for static image processing, dynamic video production, and audio creation [3] - Daily user-generated AIGC content has exceeded 10 million, demonstrating the scalable application value of multi-modal creative tools [3] Group 2: Task Resolution Capabilities - Wenxin Assistant has achieved a critical upgrade in task resolution capabilities by building a multi-tool invocation engine, allowing users to trigger cross-domain services with a single click, covering essential scenarios like life planning, health consulting, education tutoring, and workplace tasks [4] - The system can automatically integrate text, images, and short videos to generate printable task cards, showcasing the deep understanding of user intent and resource allocation capabilities of the Wenxin model [4] - Continuous learning from user interaction data is optimizing tool combination strategies and response efficiency [4] Group 3: Interactive Digital Human Technology - The newly released open real-time interactive digital human agent is a highlight of the upgrade, built on the Wenxin model 4.5 and integrating the core advantages of NOVA digital human technology [4] - The technology allows for ultra-realistic interactive experiences, replicating voice features, action habits, and micro-expressions with just 10 minutes of real human sample data, achieving industry-leading accuracy in lip-sync and naturalness of expressions [5] - The system features millisecond-level response capabilities, with real-time dialogue latency controlled within 100 milliseconds, and is compatible with multiple terminal scenarios [5] - An open service ecosystem has been established, currently integrating expert digital avatars in fields such as law, emotional support, and tourism, with plans to open a third-party developer platform in the future [5]