量子位
Search documents
全球功能最全的视频生成模型来了
量子位· 2025-12-17 10:00
Core Viewpoint - Alibaba has launched the new Tongyi Wansxiang 2.6 model, which is the most comprehensive video generation model globally, covering various capabilities such as text-to-video, image generation, and audio-driven video creation [1]. Group 1: Video Generation Capabilities - The Wansxiang 2.6 model introduces multi-audio driven video capabilities, along with features like audio-visual synchronization and multi-shot storytelling, which were not available in Sora 2 [2]. - The model demonstrates significant improvements in artistic style control, realistic portrait generation, and understanding of historical and cultural semantics in image generation [3][8]. - The model's video generation capabilities include video reference generation, maintaining subject consistency, and natural audio-visual synchronization, which enhances the overall user experience [11][12]. Group 2: Performance Testing - Initial tests show that Wansxiang 2.6 performs well in video subject consistency and prompt understanding, achieving a near 1:1 replication of the subject's appearance and matching lip movements accurately [11]. - The model's ability to generate multi-shot narratives is effective, with smooth transitions and coherent storytelling across different shots, although some abstract actions may still pose challenges [17][18]. - The model's aesthetic quality in video generation has improved, showcasing a cinematic feel and strong visual appeal, particularly in complex scenes like cyberpunk cityscapes [14][24]. Group 3: Image Generation Enhancements - Wansxiang 2.6 has made advancements in image generation, particularly in style transfer, portrait generation, and bilingual text handling, demonstrating a better grasp of new aesthetic styles [19][22]. - The model successfully generated a food promotional poster with clear bilingual text and an appealing layout, indicating its reliability in aesthetic judgment [25][27]. - Overall, the model's performance is commendable, with minor flaws in multi-character dialogue and complex action understanding, but it is deemed usable for daily short video creation and secondary creation tasks [28][29].
摩尔线程算法一鸣惊人,图形学顶会夺银!已开源
量子位· 2025-12-17 09:07
Core Viewpoint - Moore Threads won the silver medal at the 3D Gaussian Splatting Reconstruction Challenge (3DGS Challenge) during SIGGRAPH Asia 2025, showcasing its advanced algorithm capabilities and hardware-software optimization in next-generation graphics rendering technology [1][2][13]. Group 1: 3D Gaussian Splatting Technology - 3D Gaussian Splatting (3DGS) is a revolutionary 3D scene representation and rendering technology proposed in 2023, achieving an exceptional balance between image quality, efficiency, and resource usage [4]. - Compared to traditional Neural Radiance Fields (NeRF), 3DGS significantly enhances rendering efficiency by hundreds to thousands of times while maintaining realistic rendering quality, demonstrating strong adaptability in ray tracing, VR/AR real-time rendering, and multi-modal fusion [4][6]. - 3DGS is becoming a key foundational technology in embodied AI training scenarios, providing reliable support for accurate world modeling, enhancing path planning, environmental perception, and complex task execution [7][8]. Group 2: Competition and Performance - The 3DGS Challenge required participants to complete high-quality 3DGS reconstruction within 60 seconds using real terminal video sequences and SLAM point clouds, with PSNR and reconstruction speed as evaluation metrics [9][10]. - Moore Threads achieved an average PSNR of 27.58 and a reconstruction time of 34 seconds, ranking third overall and significantly outperforming many teams [15][16]. Group 3: LiteGS Development - Moore Threads developed the LiteGS foundational library to optimize the training process of 3DGS, achieving a significant reduction in training time and parameter count while maintaining high reconstruction quality [17][20]. - LiteGS can achieve up to 10.8 times training acceleration and reduce parameter count by over 50%, while also exceeding mainstream solutions in PSNR by 0.2–0.4 dB [20][21]. - LiteGS has been fully open-sourced on GitHub to promote collaboration and continuous evolution in 3D reconstruction and rendering technology [23]. Group 4: Strategic Implications - The success at the international graphics competition reflects Moore Threads' ability to grasp global technology trends and lead the future direction of graphics computing technology [23][25]. - The company will host the first MUSA Developer Conference on December 20-21, 2025, to discuss how technologies like 3DGS can shape the future and empower fields such as embodied intelligence [25].
大模型的进化方向:Words to Worlds | 对话商汤林达华
量子位· 2025-12-17 09:07
Core Insights - The article discusses the breakthrough of the SenseNova-SI model, developed by SenseTime, which has surpassed the Cambrian-S model in spatial intelligence capabilities [2][5][50] - It highlights a shift in AI paradigms, moving away from merely scaling models to a focus on foundational research and understanding of multi-modal and spatial intelligence [9][20][22] Model Performance - SenseNova-SI achieved state-of-the-art (SOTA) results across various spatial intelligence benchmarks, outperforming both open-source and proprietary models [4][5] - Specific performance metrics show SenseNova-SI scoring higher than Cambrian-S in key areas such as spatial reasoning and hallucination suppression [50] Paradigm Shift in AI - The article emphasizes that the traditional AI model scaling approach is reaching its limits, necessitating a return to fundamental research [9][15][20] - SenseTime's approach involves a new architecture called NEO, which integrates visual and language processing at the core level, allowing for better understanding of spatial relationships [39][42] Technological Innovations - The NEO architecture allows simultaneous processing of visual and textual tokens, enhancing the model's ability to understand and interact with the physical world [42][46] - SenseNova-SI demonstrates a tenfold increase in data efficiency, requiring only 10% of the training data compared to similar models to achieve SOTA performance [49] Industrial Application - The article discusses the importance of making AI technologies economically viable, emphasizing that high costs and slow processing times are barriers to widespread adoption [55][58] - SenseTime's SekoTalk product exemplifies the successful application of AI in real-time video generation, significantly reducing processing time from hours to real-time [64][66] Future Directions - The article encourages young researchers and entrepreneurs to explore diverse fields beyond large language models, such as embodied intelligence and AI for science [68][70] - It concludes with a vision for China's potential in developing AI that deeply interacts with the physical world, positioning it as a leader in this emerging landscape [72][73]
让大模型“吃一堑长一智”,南理工百度等提出模型记忆新方法
量子位· 2025-12-17 09:07
Core Viewpoint - The article discusses a new method called ViLoMem, developed by Nanjing University of Science and Technology in collaboration with Baidu, which addresses the issue of large models having poor memory retention, enabling them to learn from past mistakes by separating visual and logical errors into distinct memory streams [1][5]. Group 1: ViLoMem Framework - ViLoMem employs a dual-stream semantic memory system that allows models to remember visual and logical errors separately, enhancing their ability to learn from experiences [15][16]. - The framework consists of two main components: memory generation and memory retrieval, which work together to improve the model's performance without altering its parameters [18][5]. Group 2: Memory Generation - When a model fails on a task, ViLoMem activates two branches: a visual analysis module to identify visual errors and a logical analysis module to pinpoint logical mistakes, generating structured guidelines for both types of errors [19][20][21]. - Newly generated memories are matched for similarity with existing memories to either merge them into more abstract rules or create new memory slots, preventing memory overload while allowing for the abstraction of general semantic patterns [22][24]. Group 3: Memory Retrieval - The retrieval strategies for visual and logical memories differ, with visual memory using a two-stage retrieval process that includes image-level similarity search and question semantic filtering [27][28]. - Logical memory retrieval focuses on understanding the problem first before searching for relevant rules, which is more effective than simple keyword matching [29]. Group 4: Performance Improvement - ViLoMem has shown significant performance improvements across six multimodal reasoning benchmarks, with notable gains in mathematical tasks, such as a +6.48 increase for GPT-4.1 on MathVision [2][31]. - Smaller models benefit even more from ViLoMem, with Qwen3-VL-8B achieving a +4.38 increase on MMMU [31]. Group 5: Cross-Model Memory Transfer - An interesting experiment demonstrated that smaller models could achieve better scores by utilizing memories generated by larger models, indicating a form of "free knowledge distillation" [34][36]. - This suggests that experiences from stronger models can directly enhance the performance of weaker models without the need for fine-tuning [36].
挖掘注意力中的运动线索:无需训练,解锁4D场景重建能力
量子位· 2025-12-17 09:07
VGGT4D团队 投稿 量子位 | 公众号 QbitAI 如何让针对静态场景训练的3D基础模型 (3D Foundation Models) ,在不增加训练成本的前提下,具备处理动态4D场景的能力? 来自 香港科技大学(广州)与地平线(Horizon Robotics) 的研究团队提出了 VGGT4D 。该工作通过深入分析Visual Geometry Transformer (VGGT) 的内部机制,发现并利用了隐藏在注意力层中的运动线索。 VGGT4D的核心设想:能否在不进行额外训练的前提下,直接从预训练的3D基础模型中挖掘出4D感知能力? 作为一种 无需训练 (Training-free) 的框架,VGGT4D在动态物体分割、相机位姿估计及长序列4D重建等任务上均取得了优异性能。 从3D迈向4D的挑战 近年来,以VGGT、DUSt3R为代表的3D基础模型在静态场景重建中表现出色。然而,面对包含移动物体 (如行人、车辆) 的 动态4D场景 时,这些模型的性能往往显著下降。动态物体的运动不仅干扰背景几何建模,还会导致严重的相机位姿漂移。 现有的解决方案通常面临两类挑战: 计算或训练成本高: 依赖繁重的测试时 ...
量子位编辑作者招聘
量子位· 2025-12-17 09:07
AI热潮还在汹涌,但如果你还不知道如何参与……那为什么不来 量子位 呢? 我们是一家以 追踪AI新进展 为核心的内容平台,经过8年积累,目前拥有顶流影响力,广泛且备受认可的产业资源,以及时代风口的最佳观 测和学习生态位。 目前,我们有 三大方向 岗位招聘,希望你是 (或者能成为) 这三个方向的内容专家: 岗位均为全职,工作地点:北京中关村。 岗位面向: 编辑部 发自 凹非寺 量子位 | 公众号 QbitAI 加入我们,你可以获得: 以下是岗位详情: 所有岗位不同能力层级职位均在开放,欢迎结合个人履历和经验申请。 AI产业方向 岗位职责: AI产业方向 :关注基建层创新,包含芯片、AI Infra、云计算; AI财经方向 :关注AI领域创投和财报,跟踪产业链资本动向; AI产品方向 :关注AI在应用和硬件终端方向的进展。 社招:覆盖编辑、主笔、主编各个层级,按能力匹配岗位; 校招:应届毕业生,接受实习且可转正。 站在AI浪潮之巅 :第一时间接触和了解AI领域最新技术和产品,构建完整的AI认知体系。 玩转AI新工具 :将各种AI新技术、新工具应用于工作,提升工作效率和创造力。 打造个人影响力 :通过撰写独家原创内 ...
英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它
量子位· 2025-12-17 03:38
Core Viewpoint - NVIDIA's acquisition of SchedMD is seen as a strategic move to expand its competitive edge in the HPC and AI sectors by integrating SchedMD's Slurm system into its ecosystem, thereby enhancing its influence beyond hardware to resource scheduling [1][11][15]. Group 1: SchedMD Overview - SchedMD, founded in 2010, specializes in large-scale computing task scheduling technology [5]. - The core asset of SchedMD is the open-source workload management system Slurm, which efficiently allocates computing resources across numerous devices for tasks such as AI model training and scientific research [6][8]. - Slurm is utilized by over half of the TOP500 supercomputers globally, as well as by major tech companies like Meta and various AI startups [3][9][10]. Group 2: Strategic Importance of the Acquisition - The acquisition allows for low integration costs due to a decade-long collaboration between NVIDIA and SchedMD, facilitating a smooth transition of technology and team integration [12][13]. - The strategic value of the acquisition lies in extending NVIDIA's influence from hardware to scheduling, ensuring that even clients using AMD or Intel chips will rely on NVIDIA's ecosystem through Slurm [14][15]. - This move further solidifies NVIDIA's position among key customer groups, including supercomputing centers, cloud providers, and AI enterprises [16]. Group 3: Future Considerations - NVIDIA has committed to maintaining Slurm's open-source and vendor-neutral attributes, ensuring continued access for global users [18]. - However, there are concerns regarding NVIDIA's ongoing investment in critical projects like Slinky, which supports Slurm-on-Kubernetes services, raising questions about the future stability of related businesses [19][21].
Google全链路赋能出海:3人团队调度千个智能体,可成独角兽|MEET2026
量子位· 2025-12-17 03:38
Core Insights - The future will be characterized by autonomous collaboration among intelligent agents, solving complex problems, automating workflows, and autonomously issuing tasks, creating a new business model [1] - AI agents are becoming new productivity units, injecting new meaning into the globalization logic of startups [2] - The intelligent agent sector is just beginning, with significant changes expected in the next one to two years, presenting a major opportunity for Chinese startups to go global [3] Google’s Integrated Solutions for Startups - Google has launched AI-driven integrated solutions to empower startups for efficient globalization [4] - The MEET2026 conference attracted nearly 1,500 offline attendees and over 3.5 million online viewers, highlighting the significant interest in the topic [6] - Startups face various challenges during globalization, and Google’s ecosystem can support them at every stage [7] Stages of Startup Globalization - The five stages of startup globalization include: 1. **Ideation and Strategic Planning**: Founders gather information and analyze competitors, often using Gemini for market research [8] 2. **Product Launch**: Google Cloud provides stable cloud infrastructure support [9] 3. **Market Validation**: Google Ads assists in reaching target customers [9] 4. **Market Expansion**: Google Play and other services support expansion into new markets [9] 5. **IPO Maturity**: Google’s data analysis tools aid in the final push before going public [10] Challenges and Innovations in AI - The AI field is evolving rapidly, with challenges such as hallucination (inaccurate or fabricated information) being addressed through better model training and engineering practices [11] - The introduction of the A2A (Agent-to-Agent) protocol aims to facilitate communication between intelligent agents across different enterprises [16] - The shift from SaaS subscription models to outcome-based payment models reflects a fundamental change in business logic, allowing small teams to scale significantly [18] Gemini's Evolution and Capabilities - Gemini has evolved from its initial version to Gemini 3, which has achieved significant advancements in reasoning, understanding, and problem-solving capabilities [15] - Key capabilities of Gemini 3 include: 1. **Extended Context Window**: Supports 1 million tokens, emphasizing the importance of context engineering [21] 2. **Native Multimodal Capability**: Understands text, video, images, and audio with improved clarity and accuracy [22] 3. **Function Calling Ability**: Enables intelligent agents to utilize external tools and services [23] - Gemini 3 is considered the safest model to date, having undergone comprehensive safety assessments [24]
是个公司都在用AI Agent,但大家真的用明白了吗??| MEET2026圆桌论坛
量子位· 2025-12-17 01:04
Core Insights - The article discusses the evolution of AI Agents, emphasizing that a significant milestone will be reached when two out of three most frequently used apps by individuals are AI Agents [1][72] - Key metrics for evaluating a good AI Agent include controllability, explainability, and the ability to execute tasks consistently and stably [1] - Many AI Agents currently face negative gross margin issues, where the cost of completing tasks exceeds users' willingness to pay, posing a challenge for entrepreneurs [2][49] Group 1: Industry Perspectives - The year 2025 is anticipated to be the "Year of the Agent," marking the initial deployment of AI Agents in standardized scenarios such as customer service and claims processing, validating their technical feasibility and value [1][4] - The industry faces the challenge of aligning technology, product, and business models to create a sustainable positive feedback loop for AI Agents [2][4] - The roundtable discussion featured insights from industry leaders, highlighting the need for a rational and pragmatic approach to the widespread application of AI Agents across various sectors [3][10] Group 2: Product Development and Use Cases - AI Agents are evolving from simple tasks to more complex functions, such as creating presentations and coding, demonstrating significant advancements in their capabilities [23][25] - Successful implementations of AI Agents have shown ROI improvements, particularly with the advent of multimodal models that enhance understanding of images and videos [20][21] - The development of coding agents has progressed from writing code to executing entire workflows, resulting in efficiency gains of 3 to 5 times in software engineering tasks [25][35] Group 3: Key Challenges and Future Directions - A major challenge for AI Agents is the discrepancy between operational costs and user payment willingness, which hinders scalability for many startups [49] - The future evolution of AI Agents will likely focus on enhancing reliability and integrating them into physical environments, requiring advancements in both foundational models and engineering capabilities [56][57] - The industry anticipates a significant increase in AI Agent penetration in 2026, driven by major investments from leading tech companies and the emergence of user-friendly applications [58][61]
反超Nano Banana!OpenAI旗舰图像生成模型上线
量子位· 2025-12-17 01:04
Core Viewpoint - OpenAI has launched its new image generation model, GPT-Image-1.5, which aims to enhance practical usability and compete directly with other leading models in the market [2][13][14]. Summary by Sections Model Features - The new model introduces four main highlights: improved instruction adherence, precise editing, better detail retention, and a speed increase of up to four times compared to its predecessor [3][5][14]. - GPT-Image-1.5 is designed to maintain consistency in key elements such as lighting, composition, and character appearance during input, output, and multi-round editing [15][19]. Performance and Comparisons - In benchmark tests, GPT-Image-1.5 has been rated first in both text-to-image and image editing categories, surpassing the Nano Banana Pro [33]. - The model's instruction adherence rate is reported to be as high as 90%, indicating a significant lead over competitors [35]. Pricing and Accessibility - The API for GPT-Image-1.5 has seen a 20% reduction in input and output costs compared to the previous version [39]. - Pricing varies by resolution, with high-quality images costing approximately $133 per thousand and low-quality images around $9 per thousand [40]. Market Positioning - OpenAI is positioning GPT-Image-1.5 as a productivity tool with its focus on fine editing capabilities and reduced pricing, indicating a strategic shift towards enhancing practical applications [41]. - The model is now available to all ChatGPT users and API users globally, marking a significant step in OpenAI's product offerings [38].