量子位

Search documents
无需NeRF/高斯点后处理,视频秒变游戏模型成现实!新方法平均每帧仅需60秒 | ICCV 2025
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new method called V2M4 developed by a research team from KAUST, which enables the direct generation of usable 4D mesh animations from monocular video, significantly improving the efficiency and usability of animation and game content generation [1][6]. Summary by Sections Method Overview - V2M4 constructs a systematic multi-stage process that includes camera trajectory recovery, appearance optimization, topology unification, and texture synthesis, allowing videos to be transformed into models quickly [2][6]. Performance Metrics - The generated appearance and structure are highly restored, with an average processing time of about 60 seconds per frame, which is significantly faster than existing methods. It also supports "long videos," performing well even on videos with a duration of 300 frames [4][20]. Challenges in Video to Animation Conversion - Traditionally, converting a video into continuous animated mesh assets has been a long-standing challenge in visual computing, requiring high-cost methods like multi-camera setups and motion capture. Implicit methods like NeRF can replicate appearance but struggle to output topologically consistent explicit meshes [4][5]. Camera Trajectory Recovery - V2M4 employs a three-stage camera estimation strategy to reconstruct the camera perspective for each video frame, converting "camera motion" into "mesh motion" to accurately model dynamic scenes [10][11]. Appearance Consistency Optimization - To address appearance discrepancies, V2M4 utilizes a strategy from image editing called null text optimization to fine-tune the conditional embeddings of the generation network, enhancing the visual fidelity of the generated meshes [13][15]. Topology Unification - V2M4 introduces a frame-by-frame registration and topology unification mechanism, ensuring that all frames maintain a consistent topology, which is crucial for subsequent texture generation and temporal interpolation [16]. Texture Consistency Optimization - A shared global texture map is constructed for all frames to eliminate flickering and discontinuities, ensuring a smooth visual experience throughout the animation [17]. Animation Export - The method includes time interpolation and structural encapsulation of the generated mesh sequences, resulting in a smooth animation sequence that can be exported as a GLTF-compliant file for use in mainstream graphics and game engines [18]. Performance Validation - V2M4's performance is evaluated on challenging video data, demonstrating comprehensive advantages in reconstruction quality, operational efficiency, and generalization capabilities [19][20]. Visual Comparison - The visual results show that V2M4 generates meshes with superior rendering details, normal structures, and inter-frame consistency, achieving high fidelity and stable generation of continuous animations [21].
AI“压力面”,DeepSeek性能暴跌近30% | 清华&上海AI Lab
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new "stress test" framework called REST (Reasoning Evaluation through Simultaneous Testing) designed to evaluate the reasoning capabilities of large language models (LLMs) under pressure, revealing significant performance drops, particularly in multi-task scenarios [1][3][20]. Group 1: Stress Test Framework - REST framework allows multiple questions to be presented simultaneously to models, simulating real-world complex reasoning scenarios [2][6]. - The framework was developed by research teams from Shanghai AI Lab, Tsinghua University, and Renmin University of China to address limitations in current evaluation methods [1][6]. Group 2: Performance Findings - Top models, such as DeepSeek-R1, showed a drastic accuracy drop of 29.1% on the AIME24 test set under stress conditions [3][11]. - The performance of various models was significantly affected, with smaller models (7B parameters) deteriorating faster under pressure compared to larger models (32B parameters) [13][19]. Group 3: Evaluation Limitations - Current evaluation methods have three main issues: low differentiation among top models, high costs of developing new test questions, and a lack of realism in testing single questions [5][6]. - REST addresses these issues by combining multiple questions into a single prompt, allowing for a more comprehensive assessment of reasoning abilities [6][20]. Group 4: Key Reasoning Abilities - The stress test evaluates several critical reasoning abilities, including context budget allocation, cross-question interference resistance, and dynamic cognitive load management [7][8][9]. - Models that effectively manage token allocation under pressure tend to perform better, demonstrating adaptive reasoning effort distribution [17][19]. Group 5: Implications for Future Development - The findings suggest that traditional single-question evaluations may overlook significant reasoning flaws, such as question omission and incorrect reasoning summaries [20]. - REST provides a new paradigm for constructing evaluation datasets that are more cost-effective and closer to real-world applications, offering insights for developing more robust LLMs [20].
大神Karpathy都投的AI实时视频生成模型:直播都能立即转,无限时长几乎零延迟
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses the innovative AI startup Decart and its groundbreaking video model MirageLSD, which enables real-time, zero-latency video generation, revolutionizing live streaming, gaming, and video communication [4][5][7]. Group 1: Technology and Features - MirageLSD is the first AI model to achieve zero-latency, infinite real-time video generation, allowing for continuous video streams without time limitations [4][5]. - The model operates at a speed 16 times faster than previous models, generating video at 24 frames per second and allowing for ongoing prompts, transitions, and edits during video generation [6][28]. - It addresses the "error accumulation" issue found in traditional autoregressive video models, ensuring temporal coherence while generating content frame by frame [9][11]. Group 2: Innovations and Mechanisms - The model employs a custom real-time stream diffusion model (Live-Stream Diffusion) that generates each frame based on previously generated frames and user prompts, rather than relying on the entire video sequence [14]. - It utilizes Diffusion Forcing technology to independently denoise single frames during training, ensuring coherence in frame generation [15]. - The model incorporates a historical enhancement strategy to preemptively correct potential errors by simulating artifacts during training [16]. Group 3: Performance and User Interaction - MirageLSD's architecture includes an improved Transformer model and a specially designed visual encoder, which enhances processing speed and reduces latency [18][20]. - The system features a dynamic input mechanism that processes player inputs with ultra-low latency, allowing for immediate responses to changes in the environment [22]. - Users can perform actions like changing outfits or transforming objects with minimal delay, showcasing the model's interactive capabilities [23]. Group 4: Company Background and Future Developments - Decart, the company behind MirageLSD, was founded in 2023 and previously launched the Oasis model, which also supports real-time interactions [25][26]. - The team plans to regularly release upgrades and new features for MirageLSD, including facial consistency, voice control, and precise object manipulation to enhance user experience [28].
宇树王兴兴,A股上市辅导公告了
量子位· 2025-07-19 05:15
鹭羽 白交 发自 凹非寺 量子位 | 公众号 QbitAI 稚晖君之后,王兴兴也来到了资本市场门口。 创业九年,宇树科技终于走到IPO门前。这次不再是传闻。 中国证监会官网信息,宇树已在浙江证监局办理辅导备案,并且公布了首次公开发行股票 (IPO) 并上市辅导案报告。 这标志着宇树科技正式冲刺A股上市。 王兴兴持股也随即曝光,直接持股23.82%,并通过有限合伙平台合计控制34.76%股权。 关于"具身智能第一股"花落谁家,又开始有了悬念。 宇树科技启动IPO 顺利的话,最快将于2025年10月对公司进行综合评估,形成符合要求的上市申请文件。 | 辅导对象 | 杭州宇树科技股份有限公司(以下简称"字树科技"、"公司") | | | | --- | --- | --- | --- | | 成立日期 | 2016年8月26日 | | | | 注册资本 | 36,401.7906 万元 | 法定代表人 | 王兴兴 | | 注册地址 | 浙江省杭州市滨江区西兴街道东流路 88号 1幢 306 室 | | | | 控股股东及持 | 公司控股股东、实际控制人为王兴兴先生,其直接持有公司 | | | | 股比例 | 23. ...
DeepSeek终于丢了开源第一王座,但继任者依然来自中国
量子位· 2025-07-18 08:36
Core Viewpoint - Kimi K2 has surpassed DeepSeek to become the number one open-source model globally, ranking fifth overall, closely following top proprietary models like Musk's Grok 4 [1][19]. Group 1: Ranking and Performance - Kimi K2 achieved a score of 1420, placing it fifth in the overall ranking, with only a slight gap from leading proprietary models [2][22]. - The top ten models now all have scores above 1400, indicating that open-source models are increasingly competitive with proprietary ones [20][21]. Group 2: Community Engagement and Adoption - Kimi K2 has gained significant attention in the open-source community, with 5.6K stars on GitHub and nearly 100,000 downloads on Hugging Face [5][4]. - The CEO of AI search engine startup Perplexity has publicly endorsed Kimi K2, indicating its strong internal evaluation and future plans for further training based on this model [5][27]. Group 3: Model Architecture and Development - Kimi K2 inherits the DeepSeek V3 architecture but includes several parameter adjustments to optimize performance [9][12]. - Key modifications in Kimi K2's structure include increasing the number of experts, halving the number of attention heads, retaining only the first layer as dense, and implementing flexible expert routing [13][15]. Group 4: Industry Trends and Future Outlook - The stereotype that open-source models are inferior is being challenged, with industry experts predicting that open-source will increasingly outperform proprietary models [19][24]. - Tim Dettmers from the Allen Institute for AI suggests that open-source models defeating proprietary ones will become more common, highlighting their importance in localizing AI experiences [25][27].
8个月晋升独角兽,欧洲版Cursor估值18亿美元
量子位· 2025-07-18 08:36
时令 发自 凹非寺 量子位 | 公众号 QbitAI 成立仅8个月已成为 最新独角兽 , 估值飙升至 18亿美元 。 目前已拥有超 230万 免费活跃用户、 18万 付费订阅者,付费用户首月留存率甚至已 超ChatGPT 。 这不是硅谷神话,而是来自瑞典的AI新星—— L ovab le ,正在用自然语言重塑编程方式。 近日,这家公司完成了瑞典史上最大规模的A轮融资,成功筹集2亿美元。 上线数月,Lovable一直好评如潮。 有人表示Lovable让他感到惊艳,在尝试了一些开发平台(Bolt,V0,Replit)都无法完成的情况下,它竟然在短短几个小时内就生成了一个 完整的产品网站。 还有人用它生成一款新游戏。 欧洲版Cursor 与Cursor一样,Lovable也致力于利用大模型帮助用户开发应用,但它瞄准的是一个潜力更大的用户群体——那些不会编程的人。 甚至有人计划用它在30天内建立一家初创公司,整个过程全程公开。 公司在一份新闻稿中表示,这类用户及其测试活动,很可能构成了迄今为止平台上创建的1000万个项目的主要来源。 公司联合创始人兼首席执行官Osika称 : 我们的使命是让任何人都能构建。 借助大模 ...
7B模型“情商”比肩GPT-4o,腾讯突破开放域RL难题,得分直翻5倍
量子位· 2025-07-18 06:16
Core Insights - The article discusses the challenges and solutions in optimizing large models for emotional intelligence in multi-turn dialogues using Reinforcement Learning (RL) [2][4][5] - The proposed RLVER framework integrates a user simulator that acts as both the interaction environment and the reward source, addressing the three main challenges of RL in this context [2][5][11] Group 1: Challenges in RL for Emotional Intelligence - The three main challenges identified are: 1. Environmental challenge: Creating a realistic and diverse interaction environment for the model [2][4] 2. Reward challenge: Converting subjective user satisfaction into stable, long-term rewards [2][11] 3. Training challenge: Achieving stable and efficient multi-turn online RL training on large language models (LLMs) [2][4] Group 2: RLVER Framework - The RLVER framework utilizes a user simulator that embodies diverse user profiles and interaction scenarios, allowing for a rich and dynamic learning environment [7][8] - This simulator updates its emotional state based on the model's responses, providing personalized feedback that enhances the model's learning experience [9][10] Group 3: Performance Outcomes - The Qwen2.5-7B model, trained using RLVER, achieved a score of 79.2 on the Sentient-Benchmark, a significant increase from 13.3, positioning it alongside top commercial models like GPT-4o and Gemini 2.5 Pro [16][17] - The model maintained its general capabilities in areas like mathematics and coding, avoiding "catastrophic forgetting" [17] Group 4: Insights from Training - The introduction of explicit "think-then-say" prompts improved the model's ability to understand and respond empathetically, leading to two distinct paths towards empathy: "thinking models" and "reactive models" [20][21] - The choice of optimization algorithms (PPO vs. GRPO) revealed that focusing on specific dimensions of emotional intelligence can yield better overall performance [23][27] Group 5: User Simulator Insights - The RLVER team created two types of user simulators, with findings indicating that a more forgiving environment (Vanilla simulator) is beneficial for early-stage model growth compared to a more challenging environment [29][30] - Models with explicit thinking structures demonstrated greater robustness in challenging environments, suggesting that reasoning capabilities can mitigate training instability [33]
大模型IMO25数学竞赛成绩公布了
量子位· 2025-07-18 06:16
Core Viewpoint - The article discusses the results of a mathematical model evaluation conducted by MathArena, highlighting that Gemini 2.5 Pro significantly outperformed its competitors in the IMO 2025 challenge, achieving over 30% higher total scores than the second-place model, o3, which was 89% lower than Gemini [1][2]. Group 1: Evaluation Process - The evaluation was organized by MathArena, selecting models based on their past performances in MathArena competitions, including Gemini 2.5 Pro, o3, o4-mini, Grok 4, and DeepSeek-R1 [4]. - A unified prompt template was used for all models to ensure fairness, aligning with the Open Proof Corpus evaluation [5]. - Each model was run with recommended hyperparameters and a maximum token limit of 64,000 [6]. Group 2: Scoring and Judging - Four experienced human judges with IMO-level mathematics expertise were hired to assess the models, with each problem scored out of 7 points [10][11]. - Each model generated 32 initial answers, from which they selected their best four for final scoring [8]. Group 3: Performance Insights - Many models scored between 3-4 points out of 7, a phenomenon less common in human testing, indicating a disparity in capabilities between humans and models [12]. - There was a notable reduction in models overly optimizing the final answer format, suggesting progress in handling open-ended mathematical reasoning tasks [13]. - Gemini showed improvement in avoiding the fabrication of non-existent "theorems" compared to previous evaluations [14]. Group 4: Problem-Solving Performance - The models faced challenges in geometry, with the second and sixth problems yielding the lowest scores, particularly the second problem where only Grok 4 scored 4% [26][27]. - The fourth problem saw most models using similar methods to humans but making logical errors, while the fifth problem identified correct strategies but failed to provide proofs [29].
Meta全新AI组织架构曝光,这范儿有点字节
量子位· 2025-07-18 06:16
编辑部 发自 纽凹非寺 量子位 | 公众号 QbitAI 就在Meta内部一系列组织调整后,全新的架构正在初步浮出水面。不过不看不知道,一看真是哪里见过…… 都知道小扎用人均上亿美元薪酬包组队"超级智能实验室",不过最新消息是内部围绕这个实验室,已经整合出了3400多人的新组织。这个新组 织的头号负责人是 Alexandr Wang (亚历山大·王) ,title是首席人工智能官 (CAIO) ,副手是前GitHub首席执行官Nat Friedman,主 要分管AI产品和应用。 扎克伯格哐哐哐挖人,现在算是大概清楚了。 这个调整之后,AI三巨头之一的图灵奖得主 Yann LeCun ,都转向97年MIT本科辍学的亚历山大汇报了。 不过这确实不是重点,重点是这3400多人都被如何重新分工。 据说总共有4组: 这这这,不就是扎克伯格这几年朝思暮想、咬牙切齿的字节跳动的AI架构吗? 吴永辉领导的Seed,搞最前沿的AGI研究。也有基础模型技术和架构。 然后产品团队在AI基台上应用和打造产品。 唯一不同的是Meta还有一个新团队搞Llama 5,因为亚历山大在开源和闭源这件事上正在动摇小扎,所以可能会搞出两条腿走路— ...
突破户外RGB-only SLAM尺度漂移难题,精确定位+高保真重建 | ICCV'25开源
量子位· 2025-07-18 06:16
S3PO-GS团队 投稿 量子位 | 公众号 QbitAI 户外SLAM的尺度漂移问题,终于有了新解法! 香港科技大学(广州) 的研究的最新成果: S3PO-GS ,一个专门针对户外单目SLAM的3D高斯框架,已被ICCV 2025接收。 项工作的亮点在于首次实现了RGB单目SLAM的全局尺度一致性。在Waymo、KITTI和DL3DV三大户外基准测试中,S3PO-GS不仅在新视角 合成任务中刷新了SOTA纪录,更是在DL3DV场景中将跟踪误差降低了77.3%。 这篇文章做了什么? 在自动驾驶、机器人导航及AR/VR等前沿领域,SLAM技术的鲁棒性直接影响系统性能。 当前基于3D高斯(3DGS)的SLAM方案虽在室内场景表现卓越,但在仅依赖RGB输入的无界户外环境中仍面临严峻挑战: 单目系统固有的深度先验缺失导致几何信息不足,而引入单目深度估计或端到端点云模型(如MASt3R)作为几何先验时,又因帧间尺度不一 致性引发系统级尺度漂移,该问题在复杂户外场景尤为突出。 针对这一双重瓶颈,香港科技大学(广州)研究团队提出创新框架 S3PO-GS ,首次实现RGB单目SLAM的全局尺度一致性。 该方案通过三大核心技术 ...