Workflow
机器之心
icon
Search documents
硬刚Sora2,万相2.6轻松定制角色、控制分镜,普通人也能当导演
机器之心· 2025-12-17 05:28
Core Insights - The article highlights the rapid advancements in video generation technology, particularly focusing on the release of Alibaba's Wan 2.6 model, which significantly enhances user capabilities in video creation and storytelling [1][36]. Group 1: Technological Advancements - OpenAI's Sora 2 introduced a "Cameo" feature that addresses the "character consistency" issue in AI video generation, transforming the process from unpredictable to controllable [1]. - Alibaba's Wan 2.6 model is noted for its comprehensive capabilities, including voice and image synchronization, allowing users to create videos with a high degree of realism and narrative coherence [3][9]. - The new model supports a maximum video generation duration of 15 seconds, which is the highest in the domestic market, and includes a "shot control" feature for professional storytelling [3][4]. Group 2: User Experience and Accessibility - The Wan 2.5 version of the model made video creation accessible on mobile devices, while the 2.6 version further democratizes professional video production, enabling anyone to take on roles like director or actor [2][4]. - Users can create videos with high fidelity in both visual and auditory aspects, showcasing the model's ability to replicate character traits and emotional expressions accurately [11][24]. Group 3: Practical Applications - The model's capabilities extend to generating complete narrative short films, making it suitable for advertising design and short drama production [16]. - The article emphasizes the model's potential in various creative fields, including AI comic production, advertising design, and short video creation, with over ten visual creation capabilities supported [35][36]. Group 4: Conclusion and Future Implications - The release of Wan 2.6 signifies a shift from a mere "lottery" approach in AI video generation to a new phase of precise and controllable cinematic creation [36]. - The technology effectively removes barriers to creativity, allowing users to leverage their imagination as their primary production tool [37].
WAIC Future Tech 2026:全球科技曝光+合作,资本的下一个掘金点
机器之心· 2025-12-17 05:28
Core Viewpoint - The event focuses on the launch of a collaborative innovation ecosystem in the AI sector, showcasing various projects that leverage AI technology across different industries [1][2]. Group 1: Event Overview - The event is scheduled for December 20, 2025, at Tsinghua Science Park, Beijing, starting at 1:00 PM [5]. - It includes a launch ceremony for the collaborative innovation ecosystem and a roundtable with a mystery guest [2]. Group 2: Project Highlights - A total of 14 projects will be presented, primarily focusing on AI applications, infrastructure, hardware, and cutting-edge technology, targeting seed to Series A funding stages [4]. - Notable projects include: - AI-driven solutions for global mineral resource discovery [7] - Data-driven to decision-driven paradigms in large enterprises [8] - Cooling solutions for high-density computing [8] - Robotics solutions for developers [8] - AI-powered intelligent assistants [8] - AI-driven entertainment and gaming solutions [8][9].
经验记忆黑科技:LightSearcher让AI工具调用减39.6%、推理快48.6%
机器之心· 2025-12-17 05:28
Core Insights - The article discusses the challenges faced by existing RL-driven deep thinking models, particularly the trade-off between accuracy and efficiency, where frequent calls to external search tools improve accuracy but significantly increase response time [2][6]. - The introduction of the LightSearcher framework by the Beijing University of Posts and Telecommunications AI team addresses these challenges by utilizing experiential memory and adaptive reward shaping to enhance efficiency while maintaining accuracy [3][9]. Summary by Sections Introduction - The need for deep thinking models to strategically control the use of search tools is emphasized, highlighting existing methods' shortcomings in balancing accuracy and efficiency [6]. LightSearcher Framework - LightSearcher is designed to optimize the use of search tools through experiential memory, which transforms implicit reasoning paths into explicit guiding experiences, and includes adaptive reward mechanisms [9][11]. Experimental Results - Comprehensive evaluations on multiple multi-hop QA benchmark datasets demonstrate that LightSearcher maintains competitive accuracy while significantly reducing search tool calls by 39.6%, reasoning time by 48.6%, and token consumption by 21.2% [18]. - The framework's core components include: - Contrastive Experiential Reasoning, which builds a dynamic memory library from high and low-quality reasoning paths [14]. - Adaptive Reward Shaping, which minimizes redundant tool calls and balances accuracy and efficiency [14]. - Experience-based RL training, which integrates accumulated experiences into prompt templates to guide efficient reasoning [14]. Conclusion - LightSearcher provides a new pathway for constructing efficient and reliable deep reasoning systems, with potential applications extending beyond multi-hop QA to areas like code synthesis and strategic planning [18][20].
SIGGRAPH Asia 2025:摩尔线程赢图形顶会3DGS挑战赛大奖,自研LiteGS全面开源
机器之心· 2025-12-17 05:28
Core Insights - Moore Threads won the silver medal at the 3D Gaussian Splatting Reconstruction Challenge during SIGGRAPH Asia 2025, showcasing its advanced algorithm capabilities and hardware-software optimization in next-generation graphics rendering technology [1][16]. Group 1: 3D Gaussian Splatting Technology - 3D Gaussian Splatting (3DGS) is a revolutionary 3D scene representation and rendering technology that achieves an exceptional balance between image quality, efficiency, and resource usage, significantly outperforming traditional NeRF methods by enhancing rendering efficiency by hundreds to thousands of times [4][19]. - 3DGS has shown strong adaptability and scalability in areas such as ray tracing, real-time VR/AR rendering, and multi-modal fusion, making it a key technology in the evolving landscape of graphics rendering [4][8]. Group 2: Competition Overview - The 3DGS Reconstruction Challenge required participants to complete high-quality 3DGS reconstruction within 60 seconds using provided real terminal video sequences and SLAM point clouds, emphasizing both reconstruction quality and speed [10][12]. - The evaluation metrics included PSNR (Peak Signal-to-Noise Ratio) and reconstruction speed, ensuring a fair and authoritative ranking of the competing teams [12]. Group 3: Performance Results - Moore Threads' team, identified as "MT-AI," achieved an average PSNR of 27.58 and a reconstruction time of 34 seconds, placing them third overall in the competition [17][20]. - The results highlighted the company's leading capabilities in 3DGS algorithm construction and hardware-software optimization [16][20]. Group 4: LiteGS Development - Moore Threads developed the LiteGS library, which optimizes the entire pipeline from GPU systems to data management and algorithm design, achieving a training acceleration of up to 10.8 times while reducing parameter count by over 50% [20][25]. - LiteGS has been open-sourced on GitHub to promote collaboration and continuous evolution in 3D reconstruction and rendering technologies [27]. Group 5: Strategic Implications - The success at the SIGGRAPH Asia competition reflects Moore Threads' strategic understanding of global technology trends and its ability to lead future graphics computing directions [28]. - The advancements in 3DGS technology highlight the high demands for algorithm and hardware collaboration, positioning Moore Threads as a forward-thinking player in the graphics intelligence computing field [28].
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]
上海创智学院菁智人才论坛 | 海内外顶尖青年人才召集令暨海优政策宣讲会
机器之心· 2025-12-17 02:05
Core Viewpoint - Shanghai Chuangzhi Academy aims to create an innovative ecosystem that encourages value creation and provides abundant resources for talent in the field of artificial intelligence [3][10]. Group 1: Event Overview - The "Super MVP" Talent Forum is scheduled for December 26-27, 2025, and late January 2026, combining online and offline formats [5]. - The forum invites top global young talents from leading universities and research institutions in AI-related fields [6]. Group 2: Talent Requirements - The academy seeks PhD candidates or recent graduates in AI-related disciplines such as computer science, mathematics, and physics from top universities [6]. - Candidates should have formal teaching or research positions at prestigious institutions or be engaged in R&D roles at leading companies or startups [6]. Group 3: Institutional Background - Established in July 2024, Shanghai Chuangzhi Academy is a collaborative initiative between the Ministry of Education and Shanghai to explore high-level talent cultivation [10]. - The academy focuses on a student-centered approach and aims to become a hub for AI innovation [10]. Group 4: Support and Resources - The academy offers substantial computational resources, collaboration with top university mentors, and a strong engineering team [18]. - It emphasizes a flat organizational structure that promotes collaboration among independent principal investigators, students, and industry mentors [18]. Group 5: Compensation and Benefits - Starting salaries for positions at the academy are set at a million-level, with additional benefits including rental discounts for furnished housing and access to top educational and medical resources in Shanghai [20].
刚刚,OpenAI推出全新ChatGPT Images,奥特曼亮出腹肌搞宣传
机器之心· 2025-12-17 00:00
Core Viewpoint - OpenAI has launched a new version of ChatGPT Images, enhancing image generation and editing capabilities, aiming to simplify user interaction and broaden accessibility in creative processes [10][34][44]. Group 1: New Features and Improvements - The new ChatGPT Images is powered by OpenAI's flagship image generation model, offering precise editing while maintaining key details, with a fourfold increase in image generation speed [10][11]. - The model excels in various editing types, including adding, removing, combining, and replacing elements, allowing for detailed transformations while preserving important aspects of the original image [12][15]. - Enhanced instruction adherence enables the model to follow user commands more reliably, resulting in more accurate edits and better handling of complex compositions [24]. Group 2: User Experience and Accessibility - The updated Images feature is designed to make the image generation experience more enjoyable and effortless, with numerous preset filters and prompts to inspire creativity [34][44]. - The new model is available to all ChatGPT users and offers a 20% reduction in image input and output costs compared to the previous version, allowing for more image generation within the same budget [37]. - OpenAI aims to lower the psychological barrier for users by introducing an independent "Images" entry point and simplifying the interaction process, making it as easy as posting on social media [44]. Group 3: Competitive Landscape - The release of ChatGPT Images signifies a shift in the competitive landscape of AI image generation, moving from a focus on model capabilities to a comprehensive product experience [43]. - OpenAI has not released quantitative benchmark results for this update, indicating a strategic emphasis on user experience rather than purely technical performance metrics [43].
浙大联手字节:开源大规模指令跟随视频编辑数据集OpenVE-3M
机器之心· 2025-12-17 00:00
本文的作者分别来自浙江大学和字节跳动。第一作者何昊阳是来自浙江大学的博士生,研究方向聚焦于视频生成与编辑。通讯作者为浙江大学谢磊教授。 亮点总结 论文标题: OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing 1. 作者提出了一个大规模、高质量、多类别的指令跟随的视频编辑数据集 OpenVE-3M,共包含 3M 样本对,分为空间对齐和非空间对齐 2 大类别共 8 小类 别。 2. 作者提出了稳定的高质量、多类别的指令跟随视频编辑数据构造管线,确保编辑质量的同时具有多样性,促进社区研究。 3. 作者提出了一个高效且有效的指令跟随视频编辑模型 OpenVE-Edit,仅 5B 的参数量实现了 SoTA 并超过了现有开源 14B 模型效果。 4. 作者提出了一个通用的、多类别且充满挑战的指令跟随视频编辑评测集,它从 3 个关键维度评估模型在各个类别上的性能并与人类评价高度对齐。 1. 研究动机 现有指令遵循的视频编辑数据集如 InsViE-1M、Senorita-2M、Ditto-1M 主要存在数据集规 ...
PPO-Clip的「盲点」被补齐了?快手提出熵比裁剪方法,从局部约束到全局稳定的关键一跃
机器之心· 2025-12-16 10:22
本研究由快手科技语言大模型团队完成,核心作者苏振鹏,潘雷宇等。快手语言大模型团队聚焦在基础语言大模型研发、Agent RL 等前沿技术创新等方向,积累务实的探索 AGI 的能力边界,并不断推进 AI 领域新技术和新产品的发展。此前,该团队已 开源了 Klear-46B-A2.5B 和 Klear-Reasoner-8B 等模型,其中 Klear-Reasoner-8B 在数学和代码的基准测试上达到了同参数级别 模型的 SOTA 效果。 在大语言模型的后训练阶段,强化学习已成为提升模型能力和对齐质量的核心范式。然而,在广泛采用的 off-policy 的训练范式 中,更新当前策略的数据由旧的行为策略生成,导致分布漂移的问题的发生,这通常会将策略推至信任域之外,使强化学习的 训练变得不稳定。 尽管 PPO 通过重要性采样的裁剪机制缓解了部分问题,但它仅能约束已采样动作的概率变化,忽略了未采样动作的全局分布漂 移。为了应对这些挑战,快手研究团队提出了一种创新的熵比裁剪方法。该方法从全新的视角切入,通过约束策略熵的相对变 化来稳定全局分布,为强化学习训练提供了更加可靠的控制手段。 研究背景 强化学习训练过程中长期面临 ...
无问芯穹首曝智能体服务平台,以基础设施加速企业级「智能体自由」
机器之心· 2025-12-16 10:22
Core Viewpoint - The future of enterprises will be characterized by the integration of multiple intelligent agents, significantly amplifying organizational creativity and impact, as stated by the CEO of Wunwen Qinqun [1] Group 1: Intelligent Agent Ecosystem - The Wunwen Qinqun Intelligent Agent Service Platform was officially launched to provide comprehensive support for enterprises in the intelligent agent era, from customization to commercialization [3] - The platform aims to bridge the gap between infrastructure and intelligent agent development needs, addressing key challenges such as achieving production-level effectiveness and controlling costs [7][12] Group 2: Core Competitiveness in the Intelligent Era - The transition to the intelligent agent era accelerates the scaling of enterprise creativity, compressing the timeline from idea to industry [5] - The platform offers ready-to-use agent capability templates and reliable hosting services, enhancing the effectiveness and stability of intelligent agent operations [9] Group 3: Cost Control and Efficiency - The platform integrates deeply with underlying infrastructure to help enterprises flexibly control the costs associated with deploying intelligent agents, achieving efficiency improvements of 3 to 5 times compared to traditional service models [14] - It supports the integration of various tools, reducing over 70% of redundant labor in agent tool integration [16] Group 4: Real-World Applications and Impact - The platform has been validated through collaborations with industry partners, exemplified by the development of the "SysCoding Agent" for enterprise system development, which achieved over 95% completeness in its initial output [19][21] - The intelligent agent service model is being applied across various industries, providing efficient and agile services that translate industry knowledge into long-term business value [23] Group 5: Future Vision - Wunwen Qinqun aims to be a long-term partner for enterprises in the intelligent agent transformation process, focusing on converting organizational knowledge into sustainable value and defining the next generation of production paradigms [25] - The company emphasizes the importance of collaboration between academia and industry to create a closed loop of innovation and industry development [27]