量子位

Search documents
华为多路径推理破解大模型数学瓶颈,准确率超97%|ICML 2025
量子位· 2025-07-03 09:00
FOT团队 投稿 量子位 | 公众号 QbitAI 大模型越来越大,通用能力越来越强,但一遇到数学、科学、 逻辑这类复杂问题,还是常"翻车"。 为破解这一痛点, 华为诺亚方舟实验室 提出全新高阶推理框架 —— 思维森林(Forest-of-Thought,FoT) 。 该方法借鉴人类"多角度思考、反复验证"的认知方式, 打破传统LLM的线性推理范式,通过构建多棵并行推理树, 引入动态自我修正机制与 多视角共识决策策略。 论文将在7月份召开的ICML 2025大会上发表和开源。 在此基础上,FoT在多个数学推理任务中表现突出, 进一步展现了FoT相较于ToT(Tree-of- Thought)更优的推理能力。 具体而言,在GSM8K数据集上,结合FoT的QwQ- 32B模型准确率高达97.33%,超过了GPT- 4o和rStar-Math等先进模型;在更具挑战性的 AIME 2024测试中,更是将准确率提升至53.33%, 较对比方法rStar-Math高出6.66%。 | Table 5. The following summarizes the performance of FoT and | | | | ...
Gemini负责人爆料!多模态统一token表示,视觉至关重要
量子位· 2025-07-03 06:58
就在刚刚,Gemini模型行为产品负责人 Ani Baddepudi 在谷歌自家的开发者频道开启了爆料模式。 一水 闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 一次性揭秘Gemini多模态技术! 他和OpenAI前员工、现谷歌AI Studio产品负责人 (Logan Kilpatrick,右) 探讨了诸多众人好奇已久的问题: 一言以蔽之,整个谈话几乎都围绕着 Gemini多模态 展开,包括其背后设计理念、当前应用以及未来发展方向。 之所以这场谈话值得关注,实在是因为Gemini多模态过于响当当和重要了。 2023年12月,谷歌原生多模态Gemini 1.0模型正式上线,一举将AI竞赛由ChatGPT主导的文本领域带入多模态领域。 而最新的Gemini 2.5 Pro(0605) ,不仅在代码、推理等任务上更上一层楼,而且还拿下视觉能力第一,可以说夯实了谷歌在多模态领域的 领先地位。 此时回看Gemini当时的一些设计理念,会发现其前瞻性与创新性不仅为后续的发展奠定了坚实基础,而且对未来仍具有指导意义。 敲黑板,整场谈话干货满满,咱们这就开始~ 为啥Gemini一开始就被设计为多模态? 一个智能体的 ...
AI 100产品榜单报名开启了
量子位· 2025-07-03 06:58
Core Insights - The article highlights the ongoing "true user value" battle for AI products in China, emphasizing the need for products that address real user pain points rather than just novelty [1][2]. Group 1: AI Product Landscape - As of April 2025, only 14 AI apps in China have over one million daily active users (DAU), and only 23 AI web products have over one million monthly active users (MAU), indicating a significant challenge in user retention and differentiation [2]. - The majority of AI applications are struggling with user abandonment and lack of innovation, which necessitates a focus on practical applications that can withstand market fluctuations [2]. Group 2: AI 100 Initiative - The "AI 100" initiative by Quantum Bit Think Tank aims to provide a comprehensive reference for AI product innovation and transformation, featuring both quantitative and qualitative assessments [3][4]. - The initiative includes quarterly rankings such as "Comprehensive AI 100" focusing on user feedback and "Emerging AI 100" targeting high-growth potential products, along with monthly nominations for standout products [4][5]. Group 3: Evaluation Metrics - The quantitative evaluation of "AI 100" is based on real user data, covering four primary indicators: user scale, user growth, user activity, and user stickiness, with over 20 secondary metrics [5]. - The qualitative assessment considers long-term development potential through expert scoring and in-depth analysis of factors like underlying technology, market space, and monetization potential [5]. Group 4: Engagement and Community - The Quantum Bit Think Tank is currently recruiting nominations for the first "AI 100" dual rankings of 2025, inviting entrepreneurs, investors, and AI enthusiasts to participate [7]. - In addition to the "AI 100" series, the organization offers ongoing analysis and insights into AI products through various formats, including data reports and deep-dive interviews [9].
谢赛宁团队新作:不用提示词精准实现3D画面控制
量子位· 2025-07-03 04:26
Core Viewpoint - The article discusses the innovative Blender Fusion framework developed by the Sesein team, which combines graphic tools (Blender) with diffusion models to enable precise control and flexible manipulation of visual compositions, moving beyond traditional text prompts [6][9]. Group 1: Blender Fusion Framework - Blender Fusion allows users to control the positioning, rotation, and scaling of objects in generated images using keyboard or mouse inputs [2][4]. - The framework operates through a new pipeline that includes three main steps: object and scene separation, 3D editing in Blender, and high-quality image generation using diffusion models [10][9]. Group 2: Step-by-Step Process - The first step involves object-centric layering, where objects are separated from the original scene, and their 3D information is inferred using existing visual models like Segment Anything Model (SAM) and Depth Pro [13][14]. - The second step is Blender-grounded editing, allowing for detailed editing of the separated objects and camera controls within Blender [18]. - The final step is generative compositing, where a dual-stream diffusion compositor enhances the visual quality of the rendered scene while maintaining global consistency [23][22]. Group 3: Techniques and Results - Two important training techniques are introduced: source masking, which helps the model learn to restore complete images based on conditional information, and simulated object jittering, which improves the model's ability to decouple camera and object movements [24]. - Blender Fusion demonstrates effective visual generation capabilities, maintaining spatial relationships and visual coherence in complex scene edits, including single-image processing and multi-image scene reorganization [25][29]. Group 4: User Experience and Implications - The framework provides creators with greater freedom and control, allowing them to manipulate visual elements without being constrained by text prompts [33]. - The process from object layering to high-fidelity generation makes AI image synthesis more intuitive and flexible, akin to building with blocks [35].
GitHub一周2000星!国产统一图像生成模型神器升级,理解质量双up,还学会了“反思”
量子位· 2025-07-03 04:26
Core Viewpoint - The article discusses the significant upgrade of the OmniGen model, a domestic open-source unified image generation model, with the release of its 2.0 version, which supports text-to-image, image editing, and theme-driven image generation [1][2]. Summary by Sections Model Features - OmniGen2 enhances context understanding, instruction adherence, and image generation quality while maintaining a simple architecture [2]. - The model supports both image and text generation, further integrating the multi-modal technology ecosystem [2]. - The model's capabilities include natural language-based image editing, allowing for local modifications such as object addition/removal, color adjustments, expression changes, and background replacements [6][7]. - OmniGen2 can extract specified elements from input images and generate new images based on these elements, excelling in maintaining object similarity rather than facial similarity [8]. Technical Innovations - The model employs a separated architecture with a dual-encoder strategy using ViT and VAE, enhancing image consistency while preserving text generation capabilities [14][15]. - OmniGen2 addresses challenges in foundational data and evaluation by developing a process to generate image editing and context reference data from video and image data [18]. - Inspired by large language models, OmniGen2 integrates a reflection mechanism into its multi-modal generation model, allowing for iterative improvement based on user instructions and generated outputs [20][21][23]. Performance and Evaluation - OmniGen2 achieves competitive results on existing benchmarks for text-to-image and image editing tasks [25]. - The introduction of the OmniContext benchmark, which includes eight task categories for assessing consistency in personal, object, and scene generation, aims to address limitations in current evaluation methods [27]. - OmniGen2 scored 7.18 on the new benchmark, outperforming other leading open-source models, demonstrating a balance between instruction adherence and subject consistency across various task scenarios [28]. Deployment and Community Engagement - The model's weights, training code, and training data will be fully open-sourced, providing a foundation for community developers to optimize and expand the model [5][29]. - The model has generated significant interest in the open-source community, with over 2000 stars on GitHub within a week and hundreds of thousands of views on related topics [3].
DeepSeek-R2!?神秘模型惊现竞技场,真实身份引网友猜测
量子位· 2025-07-03 04:26
Core Viewpoint - The article discusses the emergence of a mysterious model named "steve" from DeepSeek, sparking speculation about its identity and performance in comparison to other models like R2 and V4 [1][5][19]. Group 1: Model Identity and Speculation - Users are speculating about the identity of "steve," with suggestions ranging from it being R2, V4, or an upgraded version of an older model [3][19]. - "steve" has been confirmed to be associated with DeepSeek, although further details about its identity remain undisclosed [8][19]. - The model's presence is not visible on the public page, but traces of it can be found in the front-end code [5][6]. Group 2: Performance Comparison - Initial tests show that "steve" has passed certain intelligence tests, but it has also failed some questions [11]. - Comparisons between "steve" and V3 indicate that "steve" produced approximately 300 lines of game code, while V3 generated around 800 lines [13]. - Overall, "steve's" performance is perceived as underwhelming compared to V3 and R1, leading to doubts about it being R2 [22][19]. Group 3: Development and Release Timeline - The anticipated release of R2 has been delayed again, attributed to dissatisfaction from CEO Liang Wenfeng regarding its performance [25]. - The slow progress of R2's development may be linked to a shortage of NVIDIA H20 chips [26]. - Speculation about R2's capabilities includes parameters such as 1.2 trillion parameters and 5.2 petabytes of training data, although these claims remain unverified [32].
大模型越反思越错,原来是长链推理通过自我说服加重幻觉 | 北邮
量子位· 2025-07-03 04:26
北邮网安团队 投稿 量子位 | 公众号 QbitAI 风险缺口:长链CoT放大"误差滚雪球" 推理大模型(RLLMs)能把复杂问题拆解成几十步推理,再给出看似缜密的结论。然而,随着推理链条变长,一个令人不安的趋势浮出水面 —— 错误不再是偶发失误,而是沿链条滚雪球式放大 。 在医疗、金融、法律等高风险场景,一次细小偏差就可能酿成灾难。 当推理链从3步延伸到50+步,幻觉率暴增10倍;反思节点也束手无策。 遗憾的是,当前安全评估几乎都停留在结果级:判定答案对错、衡量毒性与否,犹如"考试只看最后分数"。 这种做法忽视了一个关键问题: 错误到底是如何在链内生根、扩散并固化的? 如果无法洞察这一机制,就难以对症下药。 北京邮电大学的研究团队为解决这一问题,采取了以下方法: 来自北京邮电大学的研究团队通过 思维链审计实验 ,首次定量揭示了这一"越想越错"现象背后的元认知偏差: 长链推理中的反思不是纠错机制,而是给幻觉颁发"理性证书"—— 模型为保持与用户提示语义一致, 宁可篡改协议定义也不否定前提 。 首先 基于RFC协议文档构建受控知识域 ,再让模型生成 30–60步 的长链推理,并在关键节点插入reflection ...
一份假简历领5份硅谷AI工资,印度老哥真是不得了
量子位· 2025-07-03 04:26
Core Viewpoint - A group of AI startup founders collectively accused an individual named Soham Parekh of deceiving them by working remotely for multiple companies simultaneously under false pretenses [1][9][10]. Group 1 - Soham Parekh allegedly worked for 3 to 4 startups at the same time, using a fabricated resume and misleading information about his visa status [10][14][25]. - Founders reported that Parekh appeared professional during interviews, which led to several companies almost hiring him before background checks revealed his deceit [4][25][28]. - The incident sparked widespread discussion on social media, with many users creating memes and jokes about the situation, highlighting the absurdity of the circumstances [5][33]. Group 2 - The initial complaint by Suhail Doshi led to other founders sharing their similar experiences, indicating that this issue may not be isolated to just one individual [21][40]. - There is a growing community of individuals who share experiences of holding multiple jobs simultaneously, suggesting that this practice is more common than previously thought [40][44]. - The phenomenon of "overemployment" raises questions about the ethics of employees holding multiple positions while companies often have multiple founders or executives managing several ventures [44][49].
ChatGPT诞生内幕大曝光!发布前一晚还在纠结
量子位· 2025-07-03 00:45
Core Insights - The article reveals the dramatic naming process of "ChatGPT," which was finalized just the night before its launch, originally being called "Chat with GPT-3.5" [9][11] - OpenAI's initial hesitance about releasing ChatGPT stemmed from doubts regarding its performance, as only about half of the responses were deemed acceptable during testing [2][12] - Following its release, ChatGPT experienced explosive popularity, with the team realizing its potential to change the world within just a few days [3][13] Group 1: ChatGPT Development and Impact - The podcast features insights from Mark Chen and Nick Turley, key figures at OpenAI, discussing the rise of ChatGPT and its implications [4][5] - The team faced challenges such as GPU shortages and service limitations, leading to system outages, which they humorously addressed with a "fail whale" page [13][15] - OpenAI's approach to improving ChatGPT involved using Reinforcement Learning from Human Feedback (RLHF) to enhance user experience and retention [15][16] Group 2: Image Generation Technology - OpenAI's image generation technology, particularly the DALL·E series, also gained significant attention, with the first version released in January 2021 and the latest, DALL-E 3, integrated into ChatGPT in October 2023 [26][22] - The unexpected user engagement with ImageGen highlighted the need for models to generate high-quality outputs that align with user prompts [20][21] - The team observed a shift in user behavior, where ImageGen was primarily used for practical applications rather than entertainment, contrary to initial expectations [25] Group 3: Code Generation and Internal Culture - OpenAI has made strides in code generation, with models like Codex and Code Interpreter, focusing on long-term problem-solving rather than immediate responses [33][37] - The company emphasizes curiosity over formal qualifications in hiring, believing that a strong desire to learn is crucial in the rapidly evolving AI landscape [39][40] - OpenAI encourages its employees to utilize programming tools to enhance productivity and gain insights into product development [37][45] Group 4: Future Predictions and Challenges - Predictions for the next 12-18 months include advancements in AI reasoning capabilities and the emergence of new interaction forms, such as asynchronous workflows [47][50] - The company faces challenges, including competition from Meta, which has led to a temporary halt in operations and uncertainty regarding the release of future models like GPT-5 [61][62] - OpenAI's leadership believes that active engagement with AI technology is essential for users to overcome fears and misunderstandings [54][55]
Grok 4意外提前曝光,xAI巨额融资700亿,马斯克宣布“重写人类知识库”
量子位· 2025-07-03 00:45
Core Viewpoint - xAI, led by Elon Musk, has revealed the upcoming Grok 4 and Grok 4 Code models, skipping the planned Grok 3.5 version, indicating a strategy of "extreme iteration" to deliver significant updates [3][4][12]. Group 1: Grok 4 Features and Ambitions - Grok 4 is positioned as the "latest and most powerful flagship model," claiming unparalleled performance in natural language, mathematics, and reasoning [6]. - The model currently supports text modalities, with visual and image generation features expected soon, including function calls, structured outputs, and deep reasoning capabilities [7]. - Grok 4 has a context window of 130,000 tokens, which is smaller than many leading models, suggesting a focus on optimizing reasoning speed and real-time usability rather than handling long texts [8]. - The model excels in enterprise applications such as data extraction, code generation, and text summarization, with knowledge in finance, healthcare, law, and science [10]. - Grok 4 Code is specifically designed for programming, allowing integration into code editors like Cursor [11]. - Musk's ambition includes rewriting the human knowledge base using Grok 4's reasoning capabilities to correct perceived errors and fill knowledge gaps [14]. Group 2: Funding and Infrastructure - xAI has completed a significant funding round of $10 billion (approximately 71.6 billion RMB), following a previous $6 billion Series C round just over six months ago [2][25]. - The latest funding round's participants include major firms such as Valor Equity Partners, Vy Capital, Andreessen Horowitz, Sequoia Capital, and Fidelity Management, utilizing a combination of equity and debt [26]. - With this funding, xAI is poised to expand its computational capabilities, having already established a supercomputing center in Memphis, Tennessee, with 200,000 GPUs and plans for a new facility with 1 million GPUs [28][29]. - The scale of AI training workloads poses unique challenges to the power grid, with traditional designs not accounting for the rapid load fluctuations that can lead to blackouts [30][32]. - To address power consumption issues, xAI is implementing Tesla's Megapack energy storage system and collaborating with utility companies to establish industry standards [35][37].