Workflow
量子位
icon
Search documents
训练加速1.8倍,推理开销降78%!精准筛选题目高效加速RL训练丨清华KDD
量子位· 2026-02-09 09:50
Core Insights - The article discusses the significant advancements in reasoning capabilities of large language models (LLMs) through reinforcement learning fine-tuning, particularly highlighting the high costs associated with inefficient training processes [1][2]. Group 1: Training Efficiency - Traditional training methods like "Uniform Sampling" waste computational resources by randomly selecting questions that do not provide effective learning signals [2]. - The "Dynamic Sampling" approach, while more efficient, still incurs high costs due to the need for extensive self-evaluation by the model [2][6]. - The proposed MoPPS framework aims to dynamically predict question difficulty without the expensive self-evaluation process, thus enhancing training efficiency [3][6]. Group 2: MoPPS Framework - MoPPS utilizes a lightweight Bayesian model to quickly estimate question difficulty, allowing for efficient selection of training data [8][10]. - The framework models each question as a "bandit" problem, using a Beta distribution to estimate success rates based on training feedback [9][10]. - MoPPS introduces a recursive update mechanism that improves difficulty estimation over time, adapting to the model's evolving capabilities [11][13]. Group 3: Performance Improvements - MoPPS has demonstrated a training speed increase of 1.6x to 1.8x while reducing inference costs by up to 78.46% compared to traditional methods [18][21]. - The framework has shown significant advantages across various reasoning tasks, achieving better performance with fewer computational resources [18][21]. - The correlation between predicted and actual question difficulty is high, validating the effectiveness of MoPPS in accurately estimating task challenges [25][29]. Group 4: Versatility and Future Applications - MoPPS is compatible with multiple reinforcement learning algorithms and can adapt to different sampling strategies, enhancing its applicability [26][28]. - The framework's ability to incorporate prior knowledge can further accelerate initial training phases, making it a versatile tool for large-scale model fine-tuning [28][31]. - The research indicates potential for broader applications in the reinforcement learning fine-tuning of larger models in the future [31].
AI编程真面目:完整项目通过率仅27% | 上交大新基准
量子位· 2026-02-09 08:00
Core Insights - The article discusses the limitations of AI programming agents in constructing complete software projects from scratch, highlighting a significant drop in performance when tasked with end-to-end project development compared to code completion tasks [6][18][28]. Group 1: AI Programming Agents Performance - A recent study by a collaborative research team introduced ProjDevBench, the first benchmark to evaluate AI programming agents' ability to develop complete software projects from natural language requirements [5][10]. - The overall acceptance rate (AC rate) for submissions from six mainstream programming agents was only 27.38%, indicating a drastic decline in performance when transitioning from code completion to zero-based project construction [7][18]. - The study revealed that AI agents excel in completing existing code but struggle with high-level architecture design and complex logic reasoning [28]. Group 2: Benchmarking Methodology - ProjDevBench differs from traditional benchmarks by requiring agents to autonomously complete the entire development process without any initial code templates, simulating real-world software engineering tasks [10][30]. - The evaluation mechanism includes a dual assessment approach: an online judging (OJ) system for strict black-box testing (80% weight) and a code review process to identify issues not captured by OJ (20% weight) [13][30]. - The benchmark tasks were carefully selected from approximately 2,800 candidates, focusing on multi-file implementations and complex project-level tasks [14]. Group 3: Failure Modes and Limitations - The analysis of submission results highlighted several failure modes, including misunderstanding specifications, weak boundary case handling, and a lack of time complexity analysis [21][22]. - AI agents often generated syntactically correct code but missed critical business logic, indicating a gap in understanding the requirements [21]. - The study found a negative correlation between the number of interactions and performance, suggesting that agents tend to get stuck in inefficient trial-and-error loops rather than engaging in deep reasoning [23][25]. Group 4: Future Directions - The findings emphasize the need for future research to bridge the gap between code completion tools and fully autonomous software engineering capabilities [30]. - The benchmark currently includes only 20 tasks primarily in C++, with plans to expand to other programming languages and task types in the future [29].
北大谢俊逸袁新意合作论文登数学四大顶刊!合力破解50年猜想
量子位· 2026-02-09 08:00
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 北大 谢俊逸、袁新意 合作论文,被数学四大顶刊接收! 还是四大顶刊中年发文量最少的 《Acta Mathematica》 。 | For Issue | Seq. | Title | Author(s) | Date of | | --- | --- | --- | --- | --- | | | | | | Acceptance | | : | | On Kähler Ricci shrinker surfaces | Yu Li, Bing Wang | 21 January 2025 | | -- | | On Stevenhagen's conjecture | Peter Koymans, Carlo Pagano | 23 January 2025 | | -- | | Ray structures on Teichmüller space | Huiping Pan, Michael Wolf | 8 June 2025 | | -- | | Primes of the form p^2 + nq^2 | Ben Green, Mehtaab ...
黄晓明开心麻花助演!智元机器人春晚太会整活了
量子位· 2026-02-09 05:52
Core Viewpoint - The article discusses the first large-scale robot gala organized by Zhiyuan Robotics, showcasing the advancements in robotics and the company's decision to host its own event instead of participating in the traditional Spring Festival Gala [1][32][56]. Group 1: Robot Gala Highlights - The robot gala featured over 200 robots performing various acts, including dance, skits, magic, and martial arts, demonstrating their capabilities in entertainment [3][30]. - The event marked a significant evolution in robotics, showcasing robots not just performing tasks but also engaging in complex performances, such as the first robot skit and the first human-robot waltz [30][31]. - The gala included a variety of performances, such as a robot singing act from Shouxing Technology, known for its human-like robots, which added to the immersive experience [27][30]. Group 2: Industry Context and Competitors - Several robotics companies are set to participate in this year's Spring Festival Gala, including Yushu Technology, Magic Atom, Galaxy General, and Songyan Power, all of which are at critical junctures in their development [36][48]. - Yushu Technology, having previously gained attention from its 2025 Spring Festival Gala performance, is pursuing an IPO, while Magic Atom and Galaxy General are also looking to leverage the gala for visibility and market validation [37][42]. - The competition for visibility at the Spring Festival Gala has intensified, prompting Zhiyuan Robotics to withdraw and focus its budget on research and development, which has generated significant discussion in the industry [54][56].
神秘模型「Pony Alpha」火了,被曝是GLM-5
量子位· 2026-02-09 05:52
Core Viewpoint - The article discusses the launch of a new AI model called "Pony Alpha" by OpenRouter, which has generated significant interest and speculation regarding its capabilities and potential identity as a Chinese model, especially with the upcoming Lunar New Year [2][5][23]. Group 1: Model Features and Performance - Pony Alpha is described as a "stealth model" with a context window of 200K and a maximum output of 131K, optimized for coding, reasoning, and role-playing [6][7][4]. - The model has demonstrated impressive front-end capabilities, comparable to top models like Claude Opus 4.6, achieving complex tasks with single prompt inputs [8]. - Users have successfully created applications such as a global radio broadcasting website and a music player, showcasing Pony Alpha's ability to generate extensive code and sophisticated UI designs [10][12]. Group 2: Speculations and Comparisons - There is widespread speculation about the true identity of Pony Alpha, with guesses including various models like GLM-5, DeepSeek-V4, and Claude Opus 5.3, but no consensus has been reached [20][23]. - Evidence suggesting Pony Alpha may be a variant of GLM-5 includes user tests revealing similarities in tokenizer usage and stylistic features in generated outputs [23][25][26]. - The timing of the model's release aligns with announcements from other Chinese AI companies, indicating a competitive landscape leading up to the Lunar New Year [27][28].
字节开源GUI Agent登顶GitHub热榜,豆包手机核心技术突破26k Star
量子位· 2026-02-08 07:11
Core Insights - The article highlights the success of ByteDance's self-developed technology, specifically the GUI Agent model UI-TARS, which has topped GitHub's trending list and surpassed 26k stars, outperforming OpenAI's official Skills [1][3]. Group 1: Technology Overview - UI-TARS is a multi-modal AI agent that can perform complex operations on various software through natural language commands, mimicking human interactions with screens [5][9]. - The core logic of UI-TARS is "purely vision-driven," allowing the AI to observe screens like a human eye, enabling it to operate regardless of whether APIs are available or interfaces are complex [11][12]. - The technology includes two main projects: Agent TARS, which operates in both web UI and server environments, and UI-TARS-desktop, a desktop application for local computer and browser operations [6][8]. Group 2: Development and Evolution - UI-TARS aims to equip agents with four key capabilities: perception, action, reasoning, and memory [21]. - The project began a year ago and has evolved significantly, with the initial version leveraging 6 million high-quality tutorial data to enhance its deep thinking capabilities [20][24]. - Subsequent iterations, such as UI-TARS-1.5 and UI-TARS-2, have improved the agent's performance, addressing data bottlenecks and enhancing its ability to integrate various functionalities [26][28]. Group 3: Market Impact and Future Prospects - The article notes that UI-TARS has become one of the most popular open-source multi-modal agents, with significant attention from industry leaders [30]. - The technology is positioned to revolutionize how AI interacts with users, as highlighted by industry figures who predict that products like UI-TARS will significantly impact the market by 2025 [32][34]. - The article concludes by emphasizing the potential of GUI agents to bridge the gap between AI capabilities and human tasks, suggesting a transformative effect on productivity and efficiency [37][38].
硅谷不相信忠诚!AI行业玩成NBA,科学家爽拿“转会费”
量子位· 2026-02-08 07:11
Core Viewpoint - The loyalty of employees in Silicon Valley has diminished, with significant "acqui-hire" events occurring, indicating a shift towards a "mercenary" culture in the tech industry [1][3]. Group 1: Major Acqui-Hire Events - In June 2025, Meta invested $14.3 billion to acquire Alexandr Wang from Scale AI [1]. - In July 2025, Google spent $2.4 billion to acquire technology from Windsurf, bringing in its founder Varun Mohan and research team into DeepMind [1]. - In December 2025, NVIDIA reached a $20 billion agreement with Groq to acquire its core inference technology and CEO Jonathan Ross along with key executives [1]. Group 2: Talent Mobility and Motivations - Talent mobility is categorized into "voluntary" and "involuntary" job changes, with motivations including high salaries, access to cutting-edge resources, and the pursuit of promising technologies [4]. - The trend of researchers moving from Google to OpenAI began in early 2023, with at least five Google Brain researchers joining OpenAI before the launch of ChatGPT [6][7]. Group 3: High Salaries and Recruitment Strategies - Meta's aggressive recruitment strategy included a compensation package of up to $300 million over four years, with the first year's salary exceeding $100 million [15]. - The competition for AI talent has led to a "mercenary culture," where employees prioritize financial incentives over loyalty to their companies [23][24]. Group 4: Acqui-Hire as a Strategy - Acqui-hire has become a popular strategy among Silicon Valley giants, allowing companies to acquire talent without the complexities of full mergers [40]. - The case of Google acquiring Windsurf illustrates the potential fallout from such strategies, as remaining employees felt abandoned and betrayed [44]. Group 5: Cultural Shifts in the Tech Industry - A cultural shift is occurring in the tech industry, where employees are increasingly wary of long-term commitments to a single company, driven by rapid technological advancements [54][57]. - The speed of innovation in AI means that working for a startup can yield experience equivalent to several years in traditional tech roles [57]. Group 6: Domestic Talent Wars - The competition for AI talent is not limited to Silicon Valley; domestic companies are also aggressively recruiting from top labs, with Tencent and ByteDance making significant hires from OpenAI and Google DeepMind [60][62]. Group 7: The Value of AI Talent - The scarcity of top AI talent makes them a strategic asset for companies, with the potential to significantly impact model training costs and performance [64].
AI看图一本正经胡说八道?「一拉一推」让模型看得全又准|微软x清华
量子位· 2026-02-08 04:46
BiPS团队 投稿 量子位 | 公众号 QbitAI 随着视觉-语言模型 (VLM) 推理能力不断增强,一个隐蔽的问题逐渐浮现: 很多错误不是推理没做好,而是"看错了"。 在复杂视觉任务中,模型往往能正确识别对象、理解问题,甚至给出完整的推理链,却因捕捉了错误的视觉证据,得出自信却错误的答案。 现有方法通常在推理阶段"指路"——例如生成视觉提示或调用外部工具,以临时对齐证据。这类策略虽有效,却面临明显局限:视觉线索形式 受限、高度依赖具体任务,且推理开销大。更重要的是,它引出一个根本性问题: 如果模型始终需要外部提醒才知道"看哪儿",它是否真的理解了视觉世界? 为此,微软亚洲研究院与清华大学提出 BiPS (Bi-directional Perceptual Shaping) ,从源头重塑模型的"看图方式"。 BiPS不在推理时临时提示关注区域,而是在训练阶段就教会模型: 面对特定问题,哪些视觉细节必须关注,哪些可以忽略 。通过系统性地对 齐问题与视觉证据,BiPS促使模型内化一种核心能力—— 带着问题去看图 。因此,在推理时无需任何额外提示,模型也能自动聚焦于真正决 定答案的关键区域与细节。 实验表明,这种 ...
11位顶尖数学家发了篇没结果的论文,陶哲轩推荐都关注一下
量子位· 2026-02-08 04:46
Core Viewpoint - A new AI experiment initiated by 11 top mathematicians aims to test AI's ability to solve research-level mathematical problems, exploring the boundaries of "AI + Mathematics" [1][6]. Group 1: Experiment Overview - The experiment, named "First Proof," involves AI solving 10 research-level math problems that mathematicians have encountered in their work [6]. - The problems cover various branches of mathematics, including combinatorial algebra, graph theory, algebraic topology, stochastic analysis, and symplectic geometry [10]. - Initially, 20 problems were proposed, but only 10 were selected based on four criteria, ensuring AI can understand the problem statement and that there are no hidden answers [10][17]. Group 2: AI Capabilities and Limitations - Current AI systems, when tested with a single attempt, struggled to solve most of the proposed problems [24]. - The mathematicians believe that allowing human-AI interaction could improve AI's performance in providing better answers [25]. - The experiment aims to assess AI's ability to complete rigorous mathematical proofs, rather than its capacity to generate new theories or definitions [23]. Group 3: Data Integrity and Future Plans - To minimize data contamination, the experiment restricts data sharing options and ensures that the answers remain confidential during the testing phase [26][27]. - Future plans include designing a second set of problems and refining the experimental design to create a reusable and comparable benchmark for research-level mathematical capabilities [28]. - The ultimate goal is to foster human-AI collaboration in mathematics, rather than AI replacing mathematicians [29].
量子位编辑作者招聘
量子位· 2026-02-08 04:46
所有岗位不同能力层级职位均在开放,欢迎结合个人履历和经验申请。 岗位均为全职,工作地点:北京中关村。 岗位面向: 加入我们,你可以获得: 以下是岗位详情: 编辑部 发自 凹非寺 量子位 | 公众号 QbitAI AI热潮还在汹涌,但如果你还不知道如何参与……那为什么不来 量子位 呢? 我们是一家以 追踪AI新进展 为核心的内容平台,经过8年积累,目前拥有顶流影响力,广泛且备受认可的产业资源,以及时代风口的最佳观 测和学习生态位。 目前,我们有 三大方向 岗位招聘,希望你是 (或者能成为) 这三个方向的内容专家: AI产业方向 岗位职责: AI产业方向 :关注基建层创新,包含芯片、AI Infra、云计算; AI财经方向 :关注AI领域创投和财报,跟踪产业链资本动向; AI产品方向 :关注AI在应用和硬件终端方向的进展。 社招:覆盖编辑、主笔、主编各个层级,按能力匹配岗位; 校招:应届毕业生,接受实习且可转正。 站在AI浪潮之巅 :第一时间接触和了解AI领域最新技术和产品,构建完整的AI认知体系。 玩转AI新工具 :将各种AI新技术、新工具应用于工作,提升工作效率和创造力。 打造个人影响力 :通过撰写独家原创内 ...