Workflow
量子位
icon
Search documents
大模型自信心崩塌!谷歌DeepMind证实:反对意见让GPT-4o轻易放弃正确答案
量子位· 2025-07-20 05:08
Core Viewpoint - The research conducted by Google DeepMind and University College London reveals that large language models (LLMs) exhibit conflicting behaviors of being both confident and self-doubting, influenced by their sensitivity to opposing feedback [2][3][21]. Group 1: Model Behavior - LLMs tend to maintain their initial answers when they can see them, reflecting a human-like tendency to uphold one's viewpoint after making a decision [11][12]. - Conversely, when the initial answer is hidden, LLMs are more likely to change their answers, indicating an excessive sensitivity to opposing suggestions, even if those suggestions are incorrect [13][21]. - This behavior diverges from human cognition, as humans typically do not easily abandon their correct conclusions based on misleading information [15][21]. Group 2: Experimental Design - The study involved a two-round experiment where LLMs were first presented with a binary choice question and then received feedback from a fictional suggestion LLM [7][8]. - Key variables included whether the initial answer was visible to the responding LLM, which significantly affected the final decision-making process [9][10]. Group 3: Reasons for Inconsistent Behavior - The inconsistency in LLM responses is attributed to several factors: - Over-reliance on external feedback due to reinforcement learning from human feedback (RLHF), leading to a lack of independent judgment regarding the reliability of information [19][21]. - Decision-making based on statistical pattern matching rather than logical reasoning, making LLMs susceptible to misleading signals [19][21]. - The absence of a robust memory mechanism that would allow for deeper reasoning, resulting in a tendency to be swayed by opposing suggestions when the initial answer is not visible [21][22].
提速63%!中科院生成式渲染器突破效率瓶颈,一致性提升20%,破解具身数据荒难题
量子位· 2025-07-20 02:49
TC-Light团队 投稿 量子位 | 公众号 QbitAI 具身这么火,面向具身场景的生成式渲染器也来了。 中科院自动化所张兆翔教授团队研发的TC-Light,能够对具身训练任务中复杂和剧烈运动的长视频序列进行逼真的光照与纹理重渲染,同时具 备良好的时序一致性和低计算成本开销。 它能够帮助减少Sim2Real Gap以及实现Real2Real的数据增强,帮助获得具身智能训练所需的海量高质量数据。 论文Demo代码均已公开。 研究背景 光线及其与周围环境的交互共同塑造了人类以及具身智能体感知数字世界和现实世界的基本方式。 然而,在现实环境中采集不同光照与场景条件下的数据代价高昂,而仿真环境中尽管可以获得近乎无限的数据,但受限于算力资源,通常需要 对光线的多次折射衍射以及纹理精度进行近似和简化,使得视觉真实性无可避免地受到损失,在视觉层面产生Sim2Real Gap。 而如果能够借助生成式模型根据所需的光照条件对现实或仿真环境下采集到的视频数据进行重渲染,不仅够帮助获得增加已有真实数据的多样 性,并且能够弥合计算误差带来的CG感,使得从仿真器中能够得到视觉上高度真实的传感器数据,包括RL-CycleGAN在内的 ...
陶哲轩回应OpenAI新模型IMO夺金!GPT-5测试版也曝光了
量子位· 2025-07-20 02:49
Core Insights - OpenAI's latest model achieved a gold medal level at the 2025 International Mathematical Olympiad (IMO), solving 5 out of 6 problems and scoring 35 points out of a possible 42, surpassing this year's gold medal threshold [1][2][11][12]. Group 1: Model Performance - The model's performance was evaluated under conditions identical to human participants, with two 4.5-hour exams, without any tools or internet access, requiring natural language explanations for solutions [9][11]. - The gold medal score of 35 points aligns with the human participant results, where only 5 out of approximately 600 competitors achieved full marks this year [12]. - The evaluation process was rigorous, with each solution assessed by three former IMO medalists, ensuring consensus before final scoring [13]. Group 2: Breakthrough Significance - The achievement signifies a new level of creative thinking in problem-solving, with the model demonstrating rapid progress in reasoning time across various benchmarks, culminating in tackling the IMO's complex problems [14]. - The model's success indicates a departure from traditional reinforcement learning methods, showcasing its ability to construct intricate proofs akin to human mathematicians [14]. Group 3: Upcoming Developments - Alexander Wei from OpenAI indicated that GPT-5 is set to be released soon, although the IMO gold medal model remains an experimental research project with no immediate plans for public release [3][8]. - The discovery of the code "GPT-5-reasoning-alpha-2025-07-13" in third-party repositories suggests that GPT-5 is on the horizon [6][8]. Group 4: Community Reactions - The announcement of the model's success sparked significant discussion within the AI community, with notable mathematician Terence Tao expressing skepticism about the comparability of AI performance due to the lack of standardized testing environments [23][24]. - Tao emphasized that AI capabilities are influenced by various factors, including resources and methodologies, making it challenging to quantify performance uniformly [25][26]. Group 5: Independent Evaluations - The MathArena platform conducted independent assessments, revealing that even the best-performing models, such as Gemini 2.5 Pro, scored only 13 points (31%), far below the bronze medal threshold [34][35]. - The MathArena team expressed the need for transparency regarding OpenAI's methodology to validate the reported results [37].
任务级奖励提升App Agent思考力,淘天提出Mobile-R1,3B模型可超32B
量子位· 2025-07-20 02:49
他们提出了个具有任务级奖励(Task-level Reward)的交互式强化学习框架,即Mobile-R1。 而这些奖励只能引导代理预测每一步中最佳的单一动作,因此难以应对不断变化的移动环境。 比如一句指令:"打开飞猪,进入酒店套餐,进入热门直播,找到飞猪超级VIP,并关注主播"。Qwen2.5-VL-3B-Instruct在第二步失败。 淘天集团算法技术-未来生活实验室&点淘算法团队联合提出,采用多回合、任务导向的学习方式,结合在线学习和轨迹纠错,也许能提高 Agent的适应性和探索能力。 Mobile-R1团队 投稿 量子位 | 公众号 QbitAI 现有Mobile/APP Agent的工作可以适应实时环境,并执行动作,但由于它们大部分都仅依赖于动作级奖励(SFT或RL)。 △ 轨迹数据集构造流程 为了确保训练的稳定性,团队提出了一个三阶段训练过程:格式微调、动作级训练和任务级训练。此外引入新的中文基准和高质量轨迹数据 集,证明了该方法在移动代理领域的有效性。 结果Mobile-R1顺利地完成了这一任务。 轨迹数据集 团队使用Qwen2.5-VL-3B执行一系列任务获得初始轨迹,并人工标注这些初始轨迹, ...
AI打假AI,拿下SOTA丨厦大&腾讯优图
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses the innovative AIGI-Holmes method developed by Xiamen University and Tencent Youtu Lab for detecting AI-generated images, addressing the challenges of interpretability and generalization in existing detection models [2][12][36]. Group 1: Methodology - AIGI-Holmes employs a "large model + visual expert" collaborative architecture to enhance image detection capabilities [2][5]. - The method includes a dual-visual encoder architecture that integrates NPR visual experts to process both high-level semantics and low-level visual features [6]. - The Holmes Pipeline consists of three training phases: visual expert pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO) [7][22]. Group 2: Key Innovations - The AIGI-Holmes method addresses two critical bottlenecks in existing detection technologies: lack of interpretability and limited generalization capabilities [12][36]. - A new dataset, Holmes-Set, was constructed containing 45,000 images and 20,000 annotations to improve data scarcity issues, covering various types of generation defects [15][18]. - The model architecture includes a collaborative decoding strategy that merges predictions from visual experts and the large language model to enhance detection accuracy [8][25]. Group 3: Performance Evaluation - Experimental results indicate that AIGI-Holmes outperforms existing methods across all benchmarks in detection accuracy and interpretability [10][29]. - The model achieved optimal results in objective metrics (BLEU/ROUGE/METEOR/CIDEr) and subjective evaluations compared to current advanced models [31]. - In robustness tests against common distortions like JPEG compression and Gaussian blur, AIGI-Holmes maintained superior detection accuracy compared to other baseline methods [33][35]. Group 4: Future Directions - The team acknowledges limitations such as the hallucination problem, where the model may misinterpret normal features as defects, and the need for more granular understanding of visual defects [36][39]. - Future work will focus on addressing the hallucination issue, enhancing fine-grained understanding capabilities, and developing objective evaluation metrics for visual defect explanations [39].
英伟达GPU被曝严重漏洞,致模型准确率暴跌99.9%
量子位· 2025-07-20 02:49
Core Viewpoint - The article discusses a serious vulnerability discovered in NVIDIA GPUs, specifically through an attack method called GPUHammer, which can drastically reduce the accuracy of AI models running on these GPUs from 80% to as low as 0.02% [1][2][14]. Summary by Sections Vulnerability Discovery - A significant vulnerability in NVIDIA GPUs has been identified by white-hat hackers [1]. - The attack method, GPUHammer, can lead to catastrophic failures in AI model accuracy [2][3]. Attack Mechanism - GPUHammer is described as the first successful Rowhammer attack targeting GPU memory, which is not a software bug but a physical attack [6]. - The attack involves repeatedly "hammering" a specific row in memory, causing bit flips in adjacent rows, thereby altering data [7][8]. - Researchers successfully flipped critical bits in deep learning model weights, leading to severe degradation in model performance [9][10]. Experimental Results - The attack was tested on classic neural network architectures such as AlexNet, VGG, and ResNet, showing that even a single bit flip can lead to a total collapse in model performance [11][12]. - For instance, the accuracy of ResNet50 dropped from 80.26% to 0.02% after the attack [12]. Implications - The GPUHammer attack poses a significant threat to AI infrastructure, potentially leading to misidentifications in autonomous vehicles and misdiagnoses in medical AI applications [13][14]. - In shared GPU environments, malicious tenants could exploit this vulnerability to affect the performance of adjacent workloads [13]. Mitigation Measures - NVIDIA has recommended users enable a system-level error-correcting code (ECC) as a defense against GPUHammer attacks [15][16]. - ECC can correct single-bit errors but is limited in its ability to handle double-bit flips, and enabling it may lead to a performance decrease of 3%-10% [19]. Future Considerations - Different GPU architectures may have varying susceptibility to Rowhammer attacks, with some models like RTX3080 and A100 being less affected due to their distinct DRAM designs [22]. - Future GPU developments may include on-die ECC to enhance protection against such attacks [22]. - The article concludes that as AI technology advances, the need for robust security measures will become increasingly critical, indicating that GPUHammer is just the beginning of potential vulnerabilities [23].
无需NeRF/高斯点后处理,视频秒变游戏模型成现实!新方法平均每帧仅需60秒 | ICCV 2025
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new method called V2M4 developed by a research team from KAUST, which enables the direct generation of usable 4D mesh animations from monocular video, significantly improving the efficiency and usability of animation and game content generation [1][6]. Summary by Sections Method Overview - V2M4 constructs a systematic multi-stage process that includes camera trajectory recovery, appearance optimization, topology unification, and texture synthesis, allowing videos to be transformed into models quickly [2][6]. Performance Metrics - The generated appearance and structure are highly restored, with an average processing time of about 60 seconds per frame, which is significantly faster than existing methods. It also supports "long videos," performing well even on videos with a duration of 300 frames [4][20]. Challenges in Video to Animation Conversion - Traditionally, converting a video into continuous animated mesh assets has been a long-standing challenge in visual computing, requiring high-cost methods like multi-camera setups and motion capture. Implicit methods like NeRF can replicate appearance but struggle to output topologically consistent explicit meshes [4][5]. Camera Trajectory Recovery - V2M4 employs a three-stage camera estimation strategy to reconstruct the camera perspective for each video frame, converting "camera motion" into "mesh motion" to accurately model dynamic scenes [10][11]. Appearance Consistency Optimization - To address appearance discrepancies, V2M4 utilizes a strategy from image editing called null text optimization to fine-tune the conditional embeddings of the generation network, enhancing the visual fidelity of the generated meshes [13][15]. Topology Unification - V2M4 introduces a frame-by-frame registration and topology unification mechanism, ensuring that all frames maintain a consistent topology, which is crucial for subsequent texture generation and temporal interpolation [16]. Texture Consistency Optimization - A shared global texture map is constructed for all frames to eliminate flickering and discontinuities, ensuring a smooth visual experience throughout the animation [17]. Animation Export - The method includes time interpolation and structural encapsulation of the generated mesh sequences, resulting in a smooth animation sequence that can be exported as a GLTF-compliant file for use in mainstream graphics and game engines [18]. Performance Validation - V2M4's performance is evaluated on challenging video data, demonstrating comprehensive advantages in reconstruction quality, operational efficiency, and generalization capabilities [19][20]. Visual Comparison - The visual results show that V2M4 generates meshes with superior rendering details, normal structures, and inter-frame consistency, achieving high fidelity and stable generation of continuous animations [21].
AI“压力面”,DeepSeek性能暴跌近30% | 清华&上海AI Lab
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new "stress test" framework called REST (Reasoning Evaluation through Simultaneous Testing) designed to evaluate the reasoning capabilities of large language models (LLMs) under pressure, revealing significant performance drops, particularly in multi-task scenarios [1][3][20]. Group 1: Stress Test Framework - REST framework allows multiple questions to be presented simultaneously to models, simulating real-world complex reasoning scenarios [2][6]. - The framework was developed by research teams from Shanghai AI Lab, Tsinghua University, and Renmin University of China to address limitations in current evaluation methods [1][6]. Group 2: Performance Findings - Top models, such as DeepSeek-R1, showed a drastic accuracy drop of 29.1% on the AIME24 test set under stress conditions [3][11]. - The performance of various models was significantly affected, with smaller models (7B parameters) deteriorating faster under pressure compared to larger models (32B parameters) [13][19]. Group 3: Evaluation Limitations - Current evaluation methods have three main issues: low differentiation among top models, high costs of developing new test questions, and a lack of realism in testing single questions [5][6]. - REST addresses these issues by combining multiple questions into a single prompt, allowing for a more comprehensive assessment of reasoning abilities [6][20]. Group 4: Key Reasoning Abilities - The stress test evaluates several critical reasoning abilities, including context budget allocation, cross-question interference resistance, and dynamic cognitive load management [7][8][9]. - Models that effectively manage token allocation under pressure tend to perform better, demonstrating adaptive reasoning effort distribution [17][19]. Group 5: Implications for Future Development - The findings suggest that traditional single-question evaluations may overlook significant reasoning flaws, such as question omission and incorrect reasoning summaries [20]. - REST provides a new paradigm for constructing evaluation datasets that are more cost-effective and closer to real-world applications, offering insights for developing more robust LLMs [20].
大神Karpathy都投的AI实时视频生成模型:直播都能立即转,无限时长几乎零延迟
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses the innovative AI startup Decart and its groundbreaking video model MirageLSD, which enables real-time, zero-latency video generation, revolutionizing live streaming, gaming, and video communication [4][5][7]. Group 1: Technology and Features - MirageLSD is the first AI model to achieve zero-latency, infinite real-time video generation, allowing for continuous video streams without time limitations [4][5]. - The model operates at a speed 16 times faster than previous models, generating video at 24 frames per second and allowing for ongoing prompts, transitions, and edits during video generation [6][28]. - It addresses the "error accumulation" issue found in traditional autoregressive video models, ensuring temporal coherence while generating content frame by frame [9][11]. Group 2: Innovations and Mechanisms - The model employs a custom real-time stream diffusion model (Live-Stream Diffusion) that generates each frame based on previously generated frames and user prompts, rather than relying on the entire video sequence [14]. - It utilizes Diffusion Forcing technology to independently denoise single frames during training, ensuring coherence in frame generation [15]. - The model incorporates a historical enhancement strategy to preemptively correct potential errors by simulating artifacts during training [16]. Group 3: Performance and User Interaction - MirageLSD's architecture includes an improved Transformer model and a specially designed visual encoder, which enhances processing speed and reduces latency [18][20]. - The system features a dynamic input mechanism that processes player inputs with ultra-low latency, allowing for immediate responses to changes in the environment [22]. - Users can perform actions like changing outfits or transforming objects with minimal delay, showcasing the model's interactive capabilities [23]. Group 4: Company Background and Future Developments - Decart, the company behind MirageLSD, was founded in 2023 and previously launched the Oasis model, which also supports real-time interactions [25][26]. - The team plans to regularly release upgrades and new features for MirageLSD, including facial consistency, voice control, and precise object manipulation to enhance user experience [28].
宇树王兴兴,A股上市辅导公告了
量子位· 2025-07-19 05:15
鹭羽 白交 发自 凹非寺 量子位 | 公众号 QbitAI 稚晖君之后,王兴兴也来到了资本市场门口。 创业九年,宇树科技终于走到IPO门前。这次不再是传闻。 中国证监会官网信息,宇树已在浙江证监局办理辅导备案,并且公布了首次公开发行股票 (IPO) 并上市辅导案报告。 这标志着宇树科技正式冲刺A股上市。 王兴兴持股也随即曝光,直接持股23.82%,并通过有限合伙平台合计控制34.76%股权。 关于"具身智能第一股"花落谁家,又开始有了悬念。 宇树科技启动IPO 顺利的话,最快将于2025年10月对公司进行综合评估,形成符合要求的上市申请文件。 | 辅导对象 | 杭州宇树科技股份有限公司(以下简称"字树科技"、"公司") | | | | --- | --- | --- | --- | | 成立日期 | 2016年8月26日 | | | | 注册资本 | 36,401.7906 万元 | 法定代表人 | 王兴兴 | | 注册地址 | 浙江省杭州市滨江区西兴街道东流路 88号 1幢 306 室 | | | | 控股股东及持 | 公司控股股东、实际控制人为王兴兴先生,其直接持有公司 | | | | 股比例 | 23. ...