Workflow
量子位
icon
Search documents
AI打假AI,拿下SOTA丨厦大&腾讯优图
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses the innovative AIGI-Holmes method developed by Xiamen University and Tencent Youtu Lab for detecting AI-generated images, addressing the challenges of interpretability and generalization in existing detection models [2][12][36]. Group 1: Methodology - AIGI-Holmes employs a "large model + visual expert" collaborative architecture to enhance image detection capabilities [2][5]. - The method includes a dual-visual encoder architecture that integrates NPR visual experts to process both high-level semantics and low-level visual features [6]. - The Holmes Pipeline consists of three training phases: visual expert pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO) [7][22]. Group 2: Key Innovations - The AIGI-Holmes method addresses two critical bottlenecks in existing detection technologies: lack of interpretability and limited generalization capabilities [12][36]. - A new dataset, Holmes-Set, was constructed containing 45,000 images and 20,000 annotations to improve data scarcity issues, covering various types of generation defects [15][18]. - The model architecture includes a collaborative decoding strategy that merges predictions from visual experts and the large language model to enhance detection accuracy [8][25]. Group 3: Performance Evaluation - Experimental results indicate that AIGI-Holmes outperforms existing methods across all benchmarks in detection accuracy and interpretability [10][29]. - The model achieved optimal results in objective metrics (BLEU/ROUGE/METEOR/CIDEr) and subjective evaluations compared to current advanced models [31]. - In robustness tests against common distortions like JPEG compression and Gaussian blur, AIGI-Holmes maintained superior detection accuracy compared to other baseline methods [33][35]. Group 4: Future Directions - The team acknowledges limitations such as the hallucination problem, where the model may misinterpret normal features as defects, and the need for more granular understanding of visual defects [36][39]. - Future work will focus on addressing the hallucination issue, enhancing fine-grained understanding capabilities, and developing objective evaluation metrics for visual defect explanations [39].
英伟达GPU被曝严重漏洞,致模型准确率暴跌99.9%
量子位· 2025-07-20 02:49
克雷西 henry 发自 凹非寺 量子位 | 公众号 QbitAI 目前,研究人员已经在英伟达RTX A6000上成功测试了这种攻击,但不排除其他型号也可能受到影响。 英伟达这边建议用户实施一项防御措施,但这种措施会让模型性能下降10%。 那么,这个漏洞到底是怎么一回事呢? 不是Bug,而是"物理攻击" 英伟达GPU,被白帽黑客发现了严重漏洞。 通过一种名为GPUHammer的攻击方式,可以让GPU上跑的大模型,准确率从80%直接掉到0.02%,可以说是渣都不剩。 多伦多大学的研究人员形容,这种攻击就像在模型中引发灾难性的脑损伤。 GPUHammer是首个成功攻击GPU显存的Rowhammer攻击。 它并不是通过代码篡改模型文件,而是直接对你的显存"物理动手"。 它属于Rowhammer攻击的一类:攻击者通过反复"敲击"内存某一行,引发相邻行中的比特翻转(从0变成1,从1变成0),从而悄悄篡改数 据。 而在云机器学习平台或VDI设置等共享GPU环境中,恶意租户可能会对相邻的工作负载发起GPUHammer攻击,从而影响推理准确性或破坏 缓存的模型参数。 可以说,GPUHammer对AI时代的基础设施有着毁灭性的 ...
无需NeRF/高斯点后处理,视频秒变游戏模型成现实!新方法平均每帧仅需60秒 | ICCV 2025
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new method called V2M4 developed by a research team from KAUST, which enables the direct generation of usable 4D mesh animations from monocular video, significantly improving the efficiency and usability of animation and game content generation [1][6]. Summary by Sections Method Overview - V2M4 constructs a systematic multi-stage process that includes camera trajectory recovery, appearance optimization, topology unification, and texture synthesis, allowing videos to be transformed into models quickly [2][6]. Performance Metrics - The generated appearance and structure are highly restored, with an average processing time of about 60 seconds per frame, which is significantly faster than existing methods. It also supports "long videos," performing well even on videos with a duration of 300 frames [4][20]. Challenges in Video to Animation Conversion - Traditionally, converting a video into continuous animated mesh assets has been a long-standing challenge in visual computing, requiring high-cost methods like multi-camera setups and motion capture. Implicit methods like NeRF can replicate appearance but struggle to output topologically consistent explicit meshes [4][5]. Camera Trajectory Recovery - V2M4 employs a three-stage camera estimation strategy to reconstruct the camera perspective for each video frame, converting "camera motion" into "mesh motion" to accurately model dynamic scenes [10][11]. Appearance Consistency Optimization - To address appearance discrepancies, V2M4 utilizes a strategy from image editing called null text optimization to fine-tune the conditional embeddings of the generation network, enhancing the visual fidelity of the generated meshes [13][15]. Topology Unification - V2M4 introduces a frame-by-frame registration and topology unification mechanism, ensuring that all frames maintain a consistent topology, which is crucial for subsequent texture generation and temporal interpolation [16]. Texture Consistency Optimization - A shared global texture map is constructed for all frames to eliminate flickering and discontinuities, ensuring a smooth visual experience throughout the animation [17]. Animation Export - The method includes time interpolation and structural encapsulation of the generated mesh sequences, resulting in a smooth animation sequence that can be exported as a GLTF-compliant file for use in mainstream graphics and game engines [18]. Performance Validation - V2M4's performance is evaluated on challenging video data, demonstrating comprehensive advantages in reconstruction quality, operational efficiency, and generalization capabilities [19][20]. Visual Comparison - The visual results show that V2M4 generates meshes with superior rendering details, normal structures, and inter-frame consistency, achieving high fidelity and stable generation of continuous animations [21].
AI“压力面”,DeepSeek性能暴跌近30% | 清华&上海AI Lab
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new "stress test" framework called REST (Reasoning Evaluation through Simultaneous Testing) designed to evaluate the reasoning capabilities of large language models (LLMs) under pressure, revealing significant performance drops, particularly in multi-task scenarios [1][3][20]. Group 1: Stress Test Framework - REST framework allows multiple questions to be presented simultaneously to models, simulating real-world complex reasoning scenarios [2][6]. - The framework was developed by research teams from Shanghai AI Lab, Tsinghua University, and Renmin University of China to address limitations in current evaluation methods [1][6]. Group 2: Performance Findings - Top models, such as DeepSeek-R1, showed a drastic accuracy drop of 29.1% on the AIME24 test set under stress conditions [3][11]. - The performance of various models was significantly affected, with smaller models (7B parameters) deteriorating faster under pressure compared to larger models (32B parameters) [13][19]. Group 3: Evaluation Limitations - Current evaluation methods have three main issues: low differentiation among top models, high costs of developing new test questions, and a lack of realism in testing single questions [5][6]. - REST addresses these issues by combining multiple questions into a single prompt, allowing for a more comprehensive assessment of reasoning abilities [6][20]. Group 4: Key Reasoning Abilities - The stress test evaluates several critical reasoning abilities, including context budget allocation, cross-question interference resistance, and dynamic cognitive load management [7][8][9]. - Models that effectively manage token allocation under pressure tend to perform better, demonstrating adaptive reasoning effort distribution [17][19]. Group 5: Implications for Future Development - The findings suggest that traditional single-question evaluations may overlook significant reasoning flaws, such as question omission and incorrect reasoning summaries [20]. - REST provides a new paradigm for constructing evaluation datasets that are more cost-effective and closer to real-world applications, offering insights for developing more robust LLMs [20].
宇树王兴兴,A股上市辅导公告了
量子位· 2025-07-19 05:15
鹭羽 白交 发自 凹非寺 量子位 | 公众号 QbitAI 稚晖君之后,王兴兴也来到了资本市场门口。 创业九年,宇树科技终于走到IPO门前。这次不再是传闻。 中国证监会官网信息,宇树已在浙江证监局办理辅导备案,并且公布了首次公开发行股票 (IPO) 并上市辅导案报告。 这标志着宇树科技正式冲刺A股上市。 王兴兴持股也随即曝光,直接持股23.82%,并通过有限合伙平台合计控制34.76%股权。 关于"具身智能第一股"花落谁家,又开始有了悬念。 宇树科技启动IPO 顺利的话,最快将于2025年10月对公司进行综合评估,形成符合要求的上市申请文件。 | 辅导对象 | 杭州宇树科技股份有限公司(以下简称"字树科技"、"公司") | | | | --- | --- | --- | --- | | 成立日期 | 2016年8月26日 | | | | 注册资本 | 36,401.7906 万元 | 法定代表人 | 王兴兴 | | 注册地址 | 浙江省杭州市滨江区西兴街道东流路 88号 1幢 306 室 | | | | 控股股东及持 | 公司控股股东、实际控制人为王兴兴先生,其直接持有公司 | | | | 股比例 | 23. ...
大神Karpathy都投的AI实时视频生成模型:直播都能立即转,无限时长几乎零延迟
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses the innovative AI startup Decart and its groundbreaking video model MirageLSD, which enables real-time, zero-latency video generation, revolutionizing live streaming, gaming, and video communication [4][5][7]. Group 1: Technology and Features - MirageLSD is the first AI model to achieve zero-latency, infinite real-time video generation, allowing for continuous video streams without time limitations [4][5]. - The model operates at a speed 16 times faster than previous models, generating video at 24 frames per second and allowing for ongoing prompts, transitions, and edits during video generation [6][28]. - It addresses the "error accumulation" issue found in traditional autoregressive video models, ensuring temporal coherence while generating content frame by frame [9][11]. Group 2: Innovations and Mechanisms - The model employs a custom real-time stream diffusion model (Live-Stream Diffusion) that generates each frame based on previously generated frames and user prompts, rather than relying on the entire video sequence [14]. - It utilizes Diffusion Forcing technology to independently denoise single frames during training, ensuring coherence in frame generation [15]. - The model incorporates a historical enhancement strategy to preemptively correct potential errors by simulating artifacts during training [16]. Group 3: Performance and User Interaction - MirageLSD's architecture includes an improved Transformer model and a specially designed visual encoder, which enhances processing speed and reduces latency [18][20]. - The system features a dynamic input mechanism that processes player inputs with ultra-low latency, allowing for immediate responses to changes in the environment [22]. - Users can perform actions like changing outfits or transforming objects with minimal delay, showcasing the model's interactive capabilities [23]. Group 4: Company Background and Future Developments - Decart, the company behind MirageLSD, was founded in 2023 and previously launched the Oasis model, which also supports real-time interactions [25][26]. - The team plans to regularly release upgrades and new features for MirageLSD, including facial consistency, voice control, and precise object manipulation to enhance user experience [28].
DeepSeek终于丢了开源第一王座,但继任者依然来自中国
量子位· 2025-07-18 08:36
Core Viewpoint - Kimi K2 has surpassed DeepSeek to become the number one open-source model globally, ranking fifth overall, closely following top proprietary models like Musk's Grok 4 [1][19]. Group 1: Ranking and Performance - Kimi K2 achieved a score of 1420, placing it fifth in the overall ranking, with only a slight gap from leading proprietary models [2][22]. - The top ten models now all have scores above 1400, indicating that open-source models are increasingly competitive with proprietary ones [20][21]. Group 2: Community Engagement and Adoption - Kimi K2 has gained significant attention in the open-source community, with 5.6K stars on GitHub and nearly 100,000 downloads on Hugging Face [5][4]. - The CEO of AI search engine startup Perplexity has publicly endorsed Kimi K2, indicating its strong internal evaluation and future plans for further training based on this model [5][27]. Group 3: Model Architecture and Development - Kimi K2 inherits the DeepSeek V3 architecture but includes several parameter adjustments to optimize performance [9][12]. - Key modifications in Kimi K2's structure include increasing the number of experts, halving the number of attention heads, retaining only the first layer as dense, and implementing flexible expert routing [13][15]. Group 4: Industry Trends and Future Outlook - The stereotype that open-source models are inferior is being challenged, with industry experts predicting that open-source will increasingly outperform proprietary models [19][24]. - Tim Dettmers from the Allen Institute for AI suggests that open-source models defeating proprietary ones will become more common, highlighting their importance in localizing AI experiences [25][27].
8个月晋升独角兽,欧洲版Cursor估值18亿美元
量子位· 2025-07-18 08:36
时令 发自 凹非寺 量子位 | 公众号 QbitAI 成立仅8个月已成为 最新独角兽 , 估值飙升至 18亿美元 。 目前已拥有超 230万 免费活跃用户、 18万 付费订阅者,付费用户首月留存率甚至已 超ChatGPT 。 这不是硅谷神话,而是来自瑞典的AI新星—— L ovab le ,正在用自然语言重塑编程方式。 近日,这家公司完成了瑞典史上最大规模的A轮融资,成功筹集2亿美元。 上线数月,Lovable一直好评如潮。 有人表示Lovable让他感到惊艳,在尝试了一些开发平台(Bolt,V0,Replit)都无法完成的情况下,它竟然在短短几个小时内就生成了一个 完整的产品网站。 还有人用它生成一款新游戏。 欧洲版Cursor 与Cursor一样,Lovable也致力于利用大模型帮助用户开发应用,但它瞄准的是一个潜力更大的用户群体——那些不会编程的人。 甚至有人计划用它在30天内建立一家初创公司,整个过程全程公开。 公司在一份新闻稿中表示,这类用户及其测试活动,很可能构成了迄今为止平台上创建的1000万个项目的主要来源。 公司联合创始人兼首席执行官Osika称 : 我们的使命是让任何人都能构建。 借助大模 ...
7B模型“情商”比肩GPT-4o,腾讯突破开放域RL难题,得分直翻5倍
量子位· 2025-07-18 06:16
Core Insights - The article discusses the challenges and solutions in optimizing large models for emotional intelligence in multi-turn dialogues using Reinforcement Learning (RL) [2][4][5] - The proposed RLVER framework integrates a user simulator that acts as both the interaction environment and the reward source, addressing the three main challenges of RL in this context [2][5][11] Group 1: Challenges in RL for Emotional Intelligence - The three main challenges identified are: 1. Environmental challenge: Creating a realistic and diverse interaction environment for the model [2][4] 2. Reward challenge: Converting subjective user satisfaction into stable, long-term rewards [2][11] 3. Training challenge: Achieving stable and efficient multi-turn online RL training on large language models (LLMs) [2][4] Group 2: RLVER Framework - The RLVER framework utilizes a user simulator that embodies diverse user profiles and interaction scenarios, allowing for a rich and dynamic learning environment [7][8] - This simulator updates its emotional state based on the model's responses, providing personalized feedback that enhances the model's learning experience [9][10] Group 3: Performance Outcomes - The Qwen2.5-7B model, trained using RLVER, achieved a score of 79.2 on the Sentient-Benchmark, a significant increase from 13.3, positioning it alongside top commercial models like GPT-4o and Gemini 2.5 Pro [16][17] - The model maintained its general capabilities in areas like mathematics and coding, avoiding "catastrophic forgetting" [17] Group 4: Insights from Training - The introduction of explicit "think-then-say" prompts improved the model's ability to understand and respond empathetically, leading to two distinct paths towards empathy: "thinking models" and "reactive models" [20][21] - The choice of optimization algorithms (PPO vs. GRPO) revealed that focusing on specific dimensions of emotional intelligence can yield better overall performance [23][27] Group 5: User Simulator Insights - The RLVER team created two types of user simulators, with findings indicating that a more forgiving environment (Vanilla simulator) is beneficial for early-stage model growth compared to a more challenging environment [29][30] - Models with explicit thinking structures demonstrated greater robustness in challenging environments, suggesting that reasoning capabilities can mitigate training instability [33]
大模型IMO25数学竞赛成绩公布了
量子位· 2025-07-18 06:16
Core Viewpoint - The article discusses the results of a mathematical model evaluation conducted by MathArena, highlighting that Gemini 2.5 Pro significantly outperformed its competitors in the IMO 2025 challenge, achieving over 30% higher total scores than the second-place model, o3, which was 89% lower than Gemini [1][2]. Group 1: Evaluation Process - The evaluation was organized by MathArena, selecting models based on their past performances in MathArena competitions, including Gemini 2.5 Pro, o3, o4-mini, Grok 4, and DeepSeek-R1 [4]. - A unified prompt template was used for all models to ensure fairness, aligning with the Open Proof Corpus evaluation [5]. - Each model was run with recommended hyperparameters and a maximum token limit of 64,000 [6]. Group 2: Scoring and Judging - Four experienced human judges with IMO-level mathematics expertise were hired to assess the models, with each problem scored out of 7 points [10][11]. - Each model generated 32 initial answers, from which they selected their best four for final scoring [8]. Group 3: Performance Insights - Many models scored between 3-4 points out of 7, a phenomenon less common in human testing, indicating a disparity in capabilities between humans and models [12]. - There was a notable reduction in models overly optimizing the final answer format, suggesting progress in handling open-ended mathematical reasoning tasks [13]. - Gemini showed improvement in avoiding the fabrication of non-existent "theorems" compared to previous evaluations [14]. Group 4: Problem-Solving Performance - The models faced challenges in geometry, with the second and sixth problems yielding the lowest scores, particularly the second problem where only Grok 4 scored 4% [26][27]. - The fourth problem saw most models using similar methods to humans but making logical errors, while the fifth problem identified correct strategies but failed to provide proofs [29].