Workflow
量子位
icon
Search documents
「世界理解」维度看AI视频生成:Veo3和Sora2水平如何?新基准来了
量子位· 2025-10-27 08:26
VideoVerse团队 投稿 量子位 | 公众号 QbitAI 近年来,Text-to-Video (T2V) 模型取得显著进展—— 从静态帧质量到连贯的视频叙事,模型能力大幅提升,尤其是最近Sora2的爆火,让人们开始想象,T2V Model是否已经是一个真正的"世界 模型"?。 设计目标与核心内容 VideoVerse致力于评估T2V模型在 事件级时间因果与世界知识 (物理、材料、常识) 上的表现。团队从两大视角定义了 十个评测维度 : 1、动态 (Dynamic) :Event Following (事件顺序与因果) 、Mechanics (力学) 、Interaction (交互) 、Material Properties (材料特性) 、Camera Control (镜头控制) 。 2、静态 (Static) :Natural Constraints (自然/物理约束) 、Common Sense(常识)、Attribution Correctness (属性正确性) 、 2D Layout (二维布局) 、3D Depth (三维深度) 。 每条prompt对应若干二元 (Ye s/ No) ...
人工智能年度榜单火热报名中!五大奖项,寻找AI+时代的先锋力量
量子位· 2025-10-27 05:37
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 产品榜 人物榜 2025 人工智能年度 焦点人物 详细评选标准及报名方式如下。 2025 人工智能年度领航企业 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 1、注册地在中国,或主营业务主要面向中国市场; 2、主营业务属于人工智能及相关产业,或已将人工智能广泛应用于主营业务,并在细分领域居于行业领先地位; 3、具备成熟的产品或服务,已获得实际客户应用及市场认可; 4、近一年在技术创新、产品落地、市场拓展或商业模式上取得显著突破。 评选标准 : 2025 人工智能年度潜力创业公司 聚焦于中国人工智能领域创新创业力量,将评选出最具投资价值和发 ...
美团视频生成模型来了!一出手就是开源SOTA
量子位· 2025-10-27 05:37
一水 鹭羽 发自 凹非寺 量子位 | 公众号 QbitAI 美团,你是跨界上瘾了是吧!(doge) 没错,最新开源SOTA视频模型,又是来自这家"送外卖"的公司。 模型名为 LongCat-Video ,参数13.6B,支持文生/图生视频,视频时长可达数分钟。 从官方释出的demo来看,模型生成的视频不仅更加真实自然,而且懂物理的能力又双叒增强了。 无论是空中滑板: 还是一秒特效变身: 抑或是第一视角下,全程需要保持画面一致的骑车视频 (时长整整有4分多种) : 仔细看,视频的AI味儿浓度确实降低不少。 而且从测评成绩来看,其表现也相当亮眼——文生视频能力在开源模型中处于顶尖水平,整体质量优于PixVerse-V5和Wan2.2-T2V-A14B, 部分核心维度甚至可与谷歌最新、最强闭源模型Veo3媲美。 而且由于采用的是允许商用的 MIT协议 ,连Hugging Face高级主管也用三连问来表示惊叹。 中国团队竟然发布了一个MIT协议的基础视频模型??? 以及其长视频生成能力 (稳定输出5分钟) 也被视为,"我们离视频AI的终极形态又更进一步"。 so,一家外卖公司出品的视频模型究竟如何?来看更多案例。 文 ...
OpenAI产品线拉出来吓我一跳,奥特曼不愧是YC出身
量子位· 2025-10-27 05:37
Core Insights - OpenAI has adopted a strategy similar to major internet companies, focusing on expanding its product lines while leveraging its distribution channel, ChatGPT, which has approximately 1 billion users [2][4][27] - The approach involves creating a strong core application to monopolize distribution, followed by rapid experimentation with various products to identify viable offerings [25][28][30] Product Line Overview - OpenAI is developing a diverse range of products, including: - Collaborative tools for real-time interaction among ChatGPT users [9] - New AI models combining traditional large language models with reasoning capabilities [10] - ChatGPT-agent for creating and editing spreadsheets and presentations [11] - An AI-integrated web browser (Atlas) [12] - AI programming assistant (A-SWE) that simulates advanced software engineering tasks [14] - Humanoid robots and AI-driven personal devices [15][16] - Social media features for sharing ChatGPT usage experiences [17] - Personalized shopping recommendations within ChatGPT [19] - Customized models for internal AI tools based on unique client data [20] - Music generation AI for creating music from scratch [21] - The foundational ChatGPT chatbot [22] Strategic Goals - The strategy aims to first monetize through direct revenue-generating products like the AI programming assistant and then create an immersive ecosystem to retain users [32][33] - Future aspirations include integrating AI into everyday life through robots and personal devices, expanding the influence of AI beyond the virtual realm [34] Innovation and Risk Management - OpenAI's approach minimizes innovation risks by allowing for product failures without jeopardizing the core user base [29] - This strategy reflects a shift in the competitive landscape of AI, moving towards ecosystem-based competition rather than isolated breakthroughs [36] Historical Context - The current strategy is influenced by CEO Sam Altman's previous experience at Y Combinator, where the focus was on rapid growth through diverse product offerings [39][40] - OpenAI has transitioned from a purely academic institution to an AI-driven internet company, balancing profit pursuits with its mission to ensure AGI benefits humanity [43][45]
拜拜了GUI!中科院团队“LLM友好”计算机使用接口来了
量子位· 2025-10-27 05:37
Core Viewpoint - The article discusses the limitations of current LLM agents in automating computer operations, attributing the main bottleneck to the traditional command-based graphical user interface (GUI) that has been in use for over 40 years [2][4]. Group 1: Issues with Current LLM Agents - Current LLM agents face two major pain points: low success rates and inefficiency when handling complex tasks [7]. - The command-based design of GUIs requires LLMs to perform both strategic planning and detailed operational tasks, leading to inefficiencies and increased cognitive load [6][9]. - Human users excel in visual recognition and quick decision-making, while LLMs struggle with visual information and have slower response times [8]. Group 2: Proposed Solution - Declarative Interfaces - The research team proposes a shift from command-based to declarative interfaces (GOI), allowing LLMs to focus on high-level task planning while automating the underlying navigation and interaction [10][12]. - GOI separates the strategy (what to do) from the mechanism (how to do it), enabling LLMs to issue simple declarative commands [14][15]. - The implementation of GOI involves two phases: offline modeling to create a UI navigation graph and online execution using a simplified interface [16][19]. Group 3: Experimental Results - The introduction of GOI significantly improved performance, with success rates increasing from 44% to 74% when using the GPT-5 model [21]. - Failure analysis showed that after implementing GOI, 81% of failures were due to strategic errors rather than mechanism errors, indicating a successful reduction in low-level operational mistakes [24][25]. Group 4: Future Implications - The research suggests that GOI provides a clear direction for designing interaction paradigms that are more suitable for large models [27]. - It raises the question of whether future operating systems and applications should natively offer LLM-friendly declarative interfaces to facilitate the development of more powerful and versatile AI agents [28].
特斯拉世界模拟器亮相ICCV!VP亲自解密端到端自动驾驶技术路线
量子位· 2025-10-27 05:37
Core Viewpoint - Tesla has unveiled a world simulator for autonomous driving, showcasing its potential to generate realistic driving scenarios and enhance the training of AI models for self-driving technology [1][4][12]. Group 1: World Simulator Features - The simulator can create new challenging scenarios for autonomous driving tasks, such as unexpected lane changes by other vehicles [4][5]. - It allows AI to perform driving tasks in existing scenarios, avoiding pedestrians and obstacles [7][9]. - The generated scenario videos can also serve as a gaming experience for human users [9]. Group 2: End-to-End AI Approach - Tesla's VP Ashok Elluswamy emphasized that end-to-end AI is the future of autonomous driving, applicable not only to driving but also to other intelligent scenarios like the Tesla Optimus robot [12][13][14]. - The end-to-end neural network utilizes data from various sensors to generate control commands for the vehicle, contrasting with modular systems that are easier to develop initially but less effective in the long run [17]. - The end-to-end approach allows for better optimization and handling of complex driving situations, such as navigating around obstacles [18][21]. Group 3: Challenges and Solutions - One major challenge for end-to-end autonomous driving is evaluation, which Tesla addresses with its world simulator that trains on a vast dataset [22][24]. - The simulator can also facilitate large-scale reinforcement learning, potentially surpassing human performance [24]. - Other challenges include the "curse of dimensionality," interpretability, and safety guarantees, which require processing vast amounts of data [26][27][28]. Group 4: Data Utilization - Tesla collects data equivalent to 500 years of driving every day, using a complex data engine to filter high-quality samples for training [29][30]. - This extensive data collection enhances the model's generalization capabilities to handle extreme situations [30]. Group 5: Technical Approaches in the Industry - The industry is divided between two main approaches: VLA (Vision-Language Architecture) and world models, with companies like Huawei and NIO representing the latter [38][39]. - VLA proponents argue it leverages existing internet data for better understanding, while world model advocates believe it addresses the core issues of autonomous driving [41][42]. - Tesla's approach is closely watched due to its historical success in selecting effective strategies in autonomous driving development [43][44].
相机参数秒变图片!新模型打通理解生成壁垒,支持任意视角图像创作
量子位· 2025-10-27 03:31
Core Viewpoint - The article discusses the introduction of the Puffin unified multimodal model, which integrates the understanding of camera parameters and the generation of corresponding perspective images, addressing previous limitations in multimodal models [2][12]. Research Motivation - The ability to understand scenes from any perspective and hypothesize about the environment beyond the field of view allows for the mental recreation of a real-world with free viewpoints [8]. - Cameras serve as crucial interfaces for machines to interact with the physical world and achieve spatial intelligence [9]. Model Design - The Puffin model combines language regression and diffusion-based generation capabilities, enabling understanding and creation of scenes from any angle [12]. - A geometric-aligned visual encoder is introduced to maintain geometric fidelity while ensuring strong semantic understanding, addressing performance bottlenecks in existing models [14]. Thinking with Camera Concept - The concept of "thinking with camera" allows for the decoupling of camera parameters in a geometric context, establishing connections between spatial visual cues and professional photography terminology [20][21]. - The model incorporates spatially constrained visual cues and professional photography terms to bridge the gap between low/mid-level camera geometry and high-level multimodal reasoning [22][23]. Shared Thinking Chain - A shared thinking chain mechanism is introduced to unify the reasoning processes between controllable image generation and understanding tasks, enhancing the model's ability to generate accurate spatial structures [28]. Puffin-4M Dataset - The Puffin-4M dataset consists of approximately 4 million image-language-camera triples, addressing the scarcity of multimodal datasets in the spatial intelligence domain [29][30]. Experimental Results - Puffin demonstrates superior performance in camera understanding tasks, achieving significant improvements in accuracy compared to existing methods [36][38]. - The model's robustness is evident in various scene configurations, showcasing its capability for controllable image generation [41]. Applications - Puffin can assist in the insertion of virtual 3D objects into natural scene images through precise camera parameter predictions [43]. - The model can be flexibly extended to various cross-perspective tasks, including spatial imagination and world exploration, maintaining spatial consistency in generated results [44]. Future Plans - The team aims to enhance Puffin's cross-perspective capabilities and expand its application to video generation and understanding centered around camera parameters, promoting broader use in dynamic and immersive scenarios [45].
OpenAI IPO计划第一步曝光,奥特曼骚操作看傻华尔街
量子位· 2025-10-27 03:31
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI OpenAI距离IPO更近一步。 最新消息,软银批准了对OpenAI剩余的225亿美元投资,这笔融资的 条件是OpenAI要在年底前完成重组 ,为上市铺平道路。 与此同时,奥特曼各种骚操作被曝光: 他 绕过投行和律师 ,主要依靠自己的心腹和英伟达、AMD等谈判,操盘了价值 1.5万亿美元 的芯片交易。 这种非常规的交易流程导致达成的协议缺乏详细的财务条款,以及构成循环交易,遭到分析师广泛批评。 如果OpenAI重组不成,拿到的钱会变少 软银这次是真的豁出去了。 第二笔225亿美元的投资,加上之前的75亿美元,对OpenAI总投资额已经达到300亿美元。 不过这笔钱可不是白给的,OpenAI必须在年底前完成公司重组,从非营利组织转型为公益企业(public benefit corporation),为IPO铺平 道路。 这是OpenAI在4月份宣布的410亿美元融资轮的一部分,这轮融资直接把OpenAI的估值推到了2600亿美元。 不过软银也给自己留了后路:如果OpenAI年底前搞不定重组,投资额就会从300亿美元缩水到200亿美元。 与英伟达、甲骨文、AM ...
这种眼镜我建议外卖快递小哥人手一个
量子位· 2025-10-27 03:31
嚯!这种眼镜真应该给快递小哥人手配一副~ 给包裹出货只需要看一眼即可,再也不需要拿个扫码工具跑来跑去了: 一水 发自 凹非寺 量子位 | 公众号 QbitAI 好家伙,这不妥妥的科技改变生活、解放快递小哥双手的绝佳典范吗? 其中一位测试者亲自体验后,也对其"解放双手"的能力称赞不已: 你不用低头看手机,而是可以把视线向前,越过屏幕。你始终专注于前方。 而且因为有 出货+实时导航功能 ,把它给外卖小哥用也不是不行,毕竟他们更是需要频繁看手机但又非常需要注意安全的人群。 当抱着一大堆快递送货上门 (此时双手不便无法导航) 时,它也能实时帮忙指路: 不卖关子了,这就是零售业巨头 亚马逊 刚给自家快递小哥配备的"神器"—— 智能眼镜"Amelia" ,目前该眼镜的"早期版本"正在数百位送货 小哥手中进行实测。 亚马逊表示,基于先进的计算机视觉处理和人工智能: 这款眼镜让物流人员可以扫描包裹,获得前往送货点的步行路线,并获取送货证明,而无需从口袋中掏出设备。 所以,这款眼镜究竟有什么魔力?背后又藏着亚马逊哪些小心思? 咱接着看—— 明年中期量产、与Meta同台竞技 放大来看,亚马逊此次专为垂直物流场景推出的智能眼镜长这 ...
99%的AI产品都没有真正的护城河,初创产品需要做好「细分场景+生态协同」 | 对话AI播客工具Podwise
量子位· 2025-10-26 08:13
以下文章来源于量子位智库 ,作者AI 100访谈 量子位智库 . 连接AI创新,提供产业研究 分析师 刘萌媛 刘铁鹰 量子位智库 | 公众号 AI123All AI播客现在是否已成为了一门好生意? 尽管市场潜在空间尚未明晰,但不同的初创产品正在这条小众的赛道走出不同的方向。 关于Podwise: 前有量子位智库访谈过的 ListenHub ,主打通过AI将文本转成高质量播客,后有定位"专属AI电台"的 来福 ,根据用户需求生成播客内容。 而量子位智库持续观察的另一款典型产品,则走的是 播客+知识管理 的道路,通过AI为播客做转录、总结和结构化输出,它就是 Podwise Podwise作为一款独立团队开发的AI产品,实现了 上线即盈利 。 为了进一步探寻这款产品背后的逻辑、设计与规划,量子位智库邀请到了Podwise的创始团队, 也是硬地骇客播客节目的三位主播—— Saito 、 归归 和 一啸 ,和我们深入聊了聊这款产品。 在这次访谈中,三位创始人分享了Podwise作为一款AI播客效率工具的目标用户和使用场景,以及Podwise如何识别产品的PMF (产品市场 匹配度) 、如何找到自己的付费用户,还分享了 ...