量子位
Search documents
谁是最强“打工AI”?OpenAI亲自测试,结果第一不是自己
量子位· 2025-09-26 04:56
Core Insights - OpenAI has introduced a new benchmark called GDPval to evaluate the economic value of AI models in real-world tasks, covering 44 occupations that contribute a total of $3 trillion annually to the U.S. GDP [2][15] - Claude Opus 4.1 emerged as the best-performing model, with 47.6% of its outputs rated comparable to human expert results, while GPT-5 followed with 38.8% [4][6] - OpenAI's models show linear performance improvement over generations, with significant advancements in task accuracy and aesthetic capabilities [32][33] Benchmark Overview - GDPval focuses on nine key industries contributing over 5% to the U.S. GDP, selecting occupations primarily involving numerical tasks [14] - A total of 44 occupations were identified, with an average of 14 years of experience among the recruited industry experts who designed the tasks [15][18] - The tasks are based on real work outcomes, requiring an average of 7 hours to complete, with some complex tasks taking weeks [19] Evaluation Methodology - OpenAI employed a blind expert pairwise comparison method for task evaluation, achieving a 66% consistency rate with human expert ratings [26][27] - Each task underwent multiple rounds of human expert review, ensuring high quality and relevance [23][24] Model Performance - The evaluation revealed that GPT-5 excels in accuracy for text-based tasks, while Claude demonstrates superior performance in handling various file formats, showcasing strong visual perception and design capabilities [33] - OpenAI noted that combining AI models with human oversight could lead to more cost-effective and efficient task completion [35][36] Limitations and Future Plans - GDPval has limitations, including a small dataset of only 44 occupations and a focus on knowledge work that excludes physical labor [40] - OpenAI plans to expand GDPval's scope and enhance its realism and interactivity in future iterations [41]
OpenAI两位首席最新采访信息量好大!终极目标是“自动化研究员”,招人并非寻找“最出圈”的人
量子位· 2025-09-26 04:56
Core Insights - OpenAI's latest interview reveals significant advancements in GPT-5, focusing on long-term reasoning and the introduction of agentic behavior into mainstream applications [1][7][9] - The company emphasizes the importance of protecting foundational research while avoiding distractions from short-term product competition [6][48] Group 1: GPT-5 Developments - GPT-5 aims to mainstream reasoning capabilities, moving beyond previous models that focused on immediate responses [8][10] - The model represents a strategic shift towards enhancing reasoning and agentic behaviors, making it more accessible to users [9][10] Group 2: Evaluation and Progress - Current evaluation metrics are nearing saturation, necessitating new methods to assess models' abilities to discover new insights and achieve practical advancements in economically relevant areas [12][13] - OpenAI plans to focus on the time span in which models can reason and make progress, with current capabilities reaching approximately 1 to 5 hours [23][25] Group 3: Automation and Research Goals - OpenAI's long-term goal is to develop an automated researcher capable of discovering new ideas, starting with internal research automation [20][21] - The company is interested in measuring the duration of autonomous operation as a key evaluation metric [25] Group 4: Reinforcement Learning (RL) - Despite skepticism, reinforcement learning continues to thrive, with OpenAI exploring new directions and ideas [27][29] - The evolution of reward models is expected to accelerate, simplifying the process of developing effective fine-tuning datasets [29][30] Group 5: Programming and Coding - OpenAI's GPT-5-codex is designed to optimize programming tasks, addressing previous models' inefficiencies in problem-solving time allocation [32][34] - The current state of coding tools is likened to the "uncanny valley," where they are effective but not yet fully comparable to human performance [37][41] Group 6: Talent Acquisition and Research Culture - OpenAI prioritizes persistence and the ability to learn from failure in its research culture, seeking individuals with a solid technical foundation [44][46] - The company focuses on foundational research rather than merely following competitors, fostering an innovative environment [46][48] Group 7: Resource Allocation - If given additional resources, OpenAI would prioritize computational power, recognizing its critical role in research and development [49][51] - The company maintains a long-term research focus, emphasizing the importance of computational resources and physical constraints in future advancements [52]
超10万亿Tokens的高质量数据集是怎么炼成的?专访中国电信天翼AI阮宜龙
量子位· 2025-09-26 02:08
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 正所谓 "得数据者得天下" ,这家央企算是把 高质量数据集 给玩明白了—— 超过 10万亿 tokens的通用大模型语料数据,以及覆盖 14个 关键行业的专业数据集,总存储量高达 350TB! 如此庞大的体量,还不是杂乱无章的原始数据,而是经过精心标注和优化且包含多模态在内的行业数据,是随时可以在行业里"上岗"的那 种。 或许有小伙伴就要问了,这很重要吗?答案是非常确定的。 高质量数据集是经过采集、加工等数据处理,可直接用于开发和训练人工智能模型,能有效提升模型性能的数据的集合。建设高质量数据集 至关重要,因为它直接决定了AI模型的准确性、泛化性和可用性——优质数据是训练出高效准确模型的基础。 重要程度,可见一斑了。 那么这家央企到底是谁? 不卖关子,它正是AI国家队—— 中国电信天翼AI ,其打造的 星辰MaaS平台 是建设高质量数据集的关键。 星辰MaaS平台像是一个数据精炼厂,通过四大核心协同运作,构建"数据—模型—服务"的完整闭环。 其中, 基模 作为"动力引擎",提供基础认知与推理能力; 数据工具链 作为"原料库",持续输送高质量的数据资源; 模 ...
“零人”搞医学研究:清华AI智能体从灵感到论文全程自主
量子位· 2025-09-26 02:08
清华大学自动化系索津莉课题组 投稿 量子位 | 公众号 QbitAI 医学研究迎来"零人工"时代了?! 清华大学自动化系索津莉课题组,发布首个专为医疗信息学设计的全自主AI研究框架—— OpenLens AI 。 首次实现从文献挖掘→实验设计→数据分析→代码生成→可投稿论文的全链条自动化闭环。 为什么要推出该系统?主要是医疗信息学研究正陷入效率困局——多中心数据融合、知识爆炸、跨学科协作需求,使传统科研模式日益捉襟见 肘。 而OpenLens AI引入医学专属质量控制方法,生成出版级别的高质量科研论文,将科研周期从"月级"压缩至"小时级",宣告医学研究迎来"零人 工"时代。 下面详细来看—— 五大核心模块:AI科研的梦之队 OpenLens AI不仅实现全流程自动化,也在质量控制方面设立新标杆,集成四大保障机制: OpenLens AI采用模块化架构,由五个专门化的智能体协同工作,构建起完整的科研自动化流水线: 主管模块 作为全局协调者,将用户查询分解为结构化子任务,确保整个研究流程的透明度和可解释性。 文献综述者 构建自主知识探索管道,利用基于ReAct的推理框架,检索并综合相关文献,为研究提供坚实的理论基 ...
多模态推理最高加速3.2倍!华为诺亚新算法入选NeurIPS 2025
量子位· 2025-09-26 02:08
ViSpec团队 投稿 量子位 | 公众号 QbitAI 不牺牲任何生成质量,将多模态大模型推理最高加速3.2倍! 华为诺亚方舟实验室最新研究已入选NeurIPS 2025。 VLM用投机推理技术加速有限 大模型的多模态能力,正以前所未有的速度发展,但一个"老大难"问题也日益凸显: 推理速度 。 当模型需要一边"看图"一边"说话",尤其是在生成长篇图文并茂的回复时,计算成本和时间延迟会急剧增加,这极大地限制了VLM在实时交 互、边缘部署等场景下的应用。 为了让大模型"说"得更快,学术界和工业界普遍采用 投机推理 技术。它就像一个聪明的"军师" (小型的草稿模型) 和一个决断的"主公" (大型的目标模型) 。 截至目前,投机推理(Speculative Decoding)技术已成为大语言模型(LLM)推理加速的"标准动作",但在多模态大模型(VLM)上的应 用却举步维艰,现有方法加速比 不到1.5倍 ,性能提升有限。 为此,华为诺亚方舟实验室提出了一种专为视觉语言模型设计的全新推理加速框架—— 视觉感知投机推理(Vision-Aware Speculative Decoding, ViSpec) ,首次在该领域 ...
ChatGPT新功能,抢占你早上第一个打开的App
量子位· 2025-09-26 02:08
时令 发自 凹非寺 量子位 | 公众号 QbitAI ChatGPT新功能闪亮登场! ChatGPT Pulse (随时随地脉动回来) 。 号称无需提问就可以在你睡觉时带来个性化更新,并在每天早上为你送上一份精心整理的卡片。 效果是这样婶儿的: 奥特曼 也是为其疯狂打call,声称: 但网友好像不太买账,表示:这简直是广告推荐神器,可以在早上将广告更好地连接到聊天界面,为GPU来圈钱了。 这个功能是ChatGPT推出以来我最喜欢的功能,你可以把它看作一个非常称职的私人助理。 下面具体来看。 无需提问就可自动推送 OpenAI应用程序首席执行官 Fidji Simo 曾表示,"下一个前沿将是智能体,是能够代表你执行操作,并像队友一样与你并肩作战的AI助 手。" 但同时她也认为,过去ChatGPT很被动,基本是"你问什么,它答什么",需要用户自己琢磨该问啥、需要啥。 现在,ChatGPT Pulse一改往日被动模式,学会了 主动出击 。它无需提示,便能主动关注对你至关重要的事务,并及时提供相关信息、创 意灵感与行动指南。 简单来说就是,通过学习你的对话记录和手机活动(如关联的日历、邮箱、Google通讯录等),P ...
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
量子位· 2025-09-26 02:08
Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].
小米17 4499开卖,首发五代骁龙8!雷军:500亿砸自研芯片
量子位· 2025-09-25 23:54
Core Viewpoint - Xiaomi's recent product launch event showcased a significant shift towards becoming a hardcore technology company, emphasizing innovation across its product lineup, particularly the Xiaomi 17 series, which aims to compete directly with Apple's iPhone [6][7][10]. Group 1: Xiaomi 17 Series - The Xiaomi 17 series includes three models: standard, Pro, and Pro Max, with a starting price of 4499 yuan [3][11]. - The series features the new Snapdragon 8 Gen 2 mobile platform, built on a 3nm process, with a peak frequency of 4.6GHz, positioning it as a top-tier flagship [14][15]. - The design balances lightweight and premium feel, with a thickness of 8.06mm and a weight of 191g, featuring a 6.3-inch display for standard and Pro models, and a 6.9-inch display for Pro Max [18][21]. - The camera system has been enhanced with Leica tuning, focusing on portrait photography with new algorithms for skin tone restoration and detail enhancement [44][46]. - The Pro and Pro Max models introduce a new "back screen" feature, allowing for additional interaction and notifications, enhancing user experience [40][41]. Group 2: Battery and Display Innovations - The standard version of the Xiaomi 17 has a battery capacity of 7000mAh, while the Pro Max reaches 7500mAh, demonstrating superior endurance compared to the iPhone 17 [34][35]. - The display utilizes new red light-emitting materials, improving brightness efficiency by 11.4%, marking a significant technological advancement in domestic manufacturing [29][31]. Group 3: Xiaomi Pad 8 - The Xiaomi Pad 8 series was also launched, featuring an 11.2-inch 3.2K display, with a starting price of 2199 yuan, designed to be lightweight and portable [50][51]. - The Pad runs on the new Surge OS 3, enabling desktop-like functionality, supporting various applications and multitasking capabilities [57][59]. - The standard version is powered by the Snapdragon 8s Gen 4 processor, while the Pro version features the Snapdragon 8 Gen 2, with significant performance improvements [63][64]. Group 4: Future Aspirations - Xiaomi's CEO emphasized the company's commitment to developing its own SoC, with a planned investment of 50 billion yuan over the next decade, aiming for high-end market penetration [68][69].
马斯克新模型背后算法来自英伟达???
量子位· 2025-09-25 23:54
Core Viewpoint - Grok-4-fast has demonstrated exceptional performance in cost reduction and efficiency, surpassing even GPT-5, which is associated with routing capabilities [1][38]. Group 1: Performance and Efficiency - Grok-4-fast's impressive reasoning efficiency is attributed to advanced scaling of computational power [2]. - The underlying technology of Grok is linked to NVIDIA's algorithmic advancements, particularly a new model called Jet-Nemotron [3][4]. - Jet-Nemotron-2B has shown performance comparable to leading open-source models while achieving a speed increase of approximately 53 times [7]. Group 2: Technological Innovations - The key innovation behind Grok-4-fast is a new framework called PostNAS, which significantly reduces training costs and allows for more comprehensive exploration of model structures [10][11]. - PostNAS employs a hybrid structure model that retains essential attention layers while eliminating redundant ones to enhance efficiency [13][14]. - The framework includes four core components: full attention layer placement, optimal linear attention module selection, design of superior linear attention modules, and hardware-aware architecture search [12]. Group 3: Attention Mechanisms - The NVIDIA team evaluated six advanced linear attention modules, with Gated DeltaNet achieving the highest accuracy due to its data-dependent gating mechanism and delta rule [18][19]. - JetBlock, a more advanced linear attention module, utilizes dynamic convolution to adaptively generate convolution kernels based on input features, outperforming Gated DeltaNet in accuracy for mathematical reasoning and retrieval tasks [21][24]. Group 4: Hardware Optimization - NVIDIA's hardware-aware architecture search focuses on optimizing key parameters rather than solely relying on parameter size, which does not accurately reflect real hardware efficiency [27][28]. - The team found that the size of the key-value (KV) cache is crucial for throughput in long-context and long-text generation, leading to a targeted optimization approach [30][31]. Group 5: Industry Impact - PostNAS is expected to influence the AI industry by providing a low-cost, high-efficiency architecture exploration method applicable to any pre-trained transformer [34]. - The Jet-Nemotron model is open-source, allowing various manufacturers to integrate it without retraining, significantly reducing costs while maintaining accuracy [36][42]. - The potential application of Jet-Nemotron across major AI companies like OpenAI and Google could lead to widespread improvements in model performance and cost efficiency [43].
OpenAI宋飏被Meta挖跑了!扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
量子位· 2025-09-25 13:00
Core Viewpoint - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has raised significant interest in the AI research community due to his notable contributions to diffusion models and generative modeling [1][6][7]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has been a leading figure in OpenAI's Strategic Explorations Team [10][11]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he worked under the guidance of a notable professor [20][36]. - His most famous work includes the development of Consistency Models, which outperform diffusion models in speed and performance, generating images significantly faster [12][14][17]. Group 2: Impact of Yang Song's Work - The Consistency Models developed by Yang Song can generate 64 images of 256×256 pixels in approximately 3.5 seconds, showcasing a substantial improvement over existing models [12][14]. - His research has led to the creation of Continuous-Time Consistency Models, which address stability and scalability issues in earlier models, achieving a training scale of 1.5 billion parameters [15][18]. - The advancements made by Yang Song and his team are considered potential game-changers in the generative modeling field, with discussions suggesting they could "end" the dominance of diffusion models [18][19]. Group 3: Meta's Strategic Recruitment - Meta's recruitment of Yang Song is part of a broader strategy to enhance its AI capabilities by attracting top talent from leading organizations like OpenAI [9][10]. - The move is seen as a significant loss for OpenAI, with many colleagues expressing surprise at his departure [7][6]. - The motivations behind such moves are speculated to extend beyond financial incentives, as many researchers prioritize impactful work and collaboration opportunities [9].