Workflow
量子位
icon
Search documents
大模型刷数学题竟有害?CMU评估20+模型指出训练陷阱
量子位· 2025-07-07 06:13
Core Viewpoint - The article discusses the relationship between mathematical reasoning capabilities of large language models (LLMs) and their ability to transfer these skills to other tasks, highlighting that models trained with reinforcement learning (RL) show better transferability compared to those trained with supervised fine-tuning (SFT) [4][11]. Group 1: Mathematical Reasoning and Transferability - Research indicates that only models trained with RL can effectively transfer mathematical reasoning skills to other tasks, while SFT models show limited or no transfer [4][11]. - A Transferability Index (TI) is introduced to quantify the extent to which improvements in mathematical reasoning can be applied to other reasoning and non-reasoning tasks [8][9]. - If TI is greater than 0, it indicates a positive transfer effect to other tasks; if less than 0, it indicates negative transfer [9]. Group 2: Experimental Findings - The study evaluated over 20 models across various tasks, including mathematical reasoning, other reasoning tasks (like medical reasoning), and non-reasoning tasks (like common-sense dialogue) [7]. - Results show that models fine-tuned with RL consistently achieve higher transferability metrics across reasoning and non-reasoning tasks, while SFT models often experience negative transfer in non-reasoning tasks [11]. Group 3: Model Representation and Performance - PCA analysis reveals that RL fine-tuned models exhibit minimal shifts in representation space, indicating they retain previously learned knowledge while enhancing performance in specific domains [15]. - RL models demonstrate lower KL divergence in reasoning and non-reasoning tasks compared to SFT models, suggesting more stable and precise representation updates [16][18]. - The findings suggest that RL is crucial for achieving transferable reasoning capabilities in LLMs, marking another victory for reinforcement learning in this context [19].
AI发现医生看不见的隐藏心脏病风险,近90%准确率远超人类专家|Nature子刊
量子位· 2025-07-07 06:13
Core Viewpoint - The article discusses the breakthrough of the MAARS model, a multi-modal AI model developed by Johns Hopkins University, which significantly improves the prediction accuracy of sudden cardiac death risk by analyzing raw MRI images, achieving an accuracy rate of up to 93% in certain populations [2][10][12]. Group 1: MAARS Model Overview - The MAARS model utilizes a 3D Vision Transformer architecture to analyze LGE-CMR (Late Gadolinium Enhancement Cardiac Magnetic Resonance) images, avoiding subjective interpretation by human doctors [7][16]. - It can identify hidden fibrotic scar patterns in MRI images that are often overlooked by clinicians, which are critical signals for potentially fatal arrhythmias [8][9]. - The model's diagnostic accuracy for hypertrophic cardiomyopathy (HCM) has increased from 50% to nearly 90% [11]. Group 2: Performance Metrics - In internal validation, the MAARS model achieved a prediction accuracy (AUROC) of 89%, which rises to 93% in high-risk individuals aged 40 to 60 [20][10]. - Compared to traditional clinical guidelines, MAARS improves risk stratification precision for HCM by 0.27-0.35 [21]. Group 3: Multi-modal Data Integration - MAARS integrates multiple data types, including 40 structured data points from electronic health records (EHR) and 27 specialized indicators from ultrasound and CMR reports, enhancing its predictive capabilities [18][19]. - The model's design includes three single-modal branches and a multi-modal fusion module, allowing it to extract features from different data sources effectively [14][15]. Group 4: Interpretability and Clinical Application - Unlike black-box AI models, MAARS features an interpretable design that quantifies the contribution of each input feature to the prediction, enhancing clinical trust [23]. - This transparency aids in developing personalized medical plans, allowing doctors to make more informed decisions regarding interventions like implanting defibrillators [27]. Group 5: Research Team and Future Directions - The MAARS technology is led by Professor Natalia Trayanova from Johns Hopkins University, who has a notable background in computational cardiology [28][29]. - The research team plans to extend the MAARS algorithm to other conditions such as dilated cardiomyopathy and ischemic heart disease, promoting the use of AI in cardiovascular diseases [32].
DeepSeek推理最高提速6倍!开源研究:加装「思维进度条」,计算量减少30%
量子位· 2025-07-07 06:13
不圆 发自 凹非寺 量子位 | 公众号 QbitAI DeepSeek推理要详细还是要迅速,现在可以自己选了? 来自特拉维夫大学的研究团队开发出了一种新方法,可以 监控和控制LLM中的思考路径长度 。 超频能够减少不必要的推理步骤,使模型更快地得出结论,同时避免因过度推理导致的性能下降。 该模型已在gitHub上开源。 给LLM的推理任务装上进度条,还能控制推理的深度、调整推理速度。 加速后的模型和原模型相比, 使用的token数减少了近6倍,且都得出了正确答案 。 LLMs在显示结构化推理时,会隐式跟踪其在思考阶段的相对位置,并通过隐藏状态编码这一信息。 而论文提出了一种"思维进度向量"(Thinking Progress Vector, TPV ),可用于实时预测模型在推理阶段的相对位置,并通过可视化进度条 展示模型的推理动态。 通过干预TPV,可以加速或减速模型的推理过程,实现"超频"(overclocking)和"降频"(downclocking)。 方法:实时监控并控制推理深度 在有效推理学习过程中,模型必须 隐式地学习跟踪其思考阶段进度 ,并保持对例如距离最终答案有多近的估计。 由于进度跟踪依赖于 ...
“英伟达显卡就是一坨*”!博主6000字檄文怒批:烧接口、缺单元、驱动变砖还威胁媒体
量子位· 2025-07-07 04:02
克雷西 发自 凹非寺 量子位 | 公众号 QbitAI 英伟达频频翻车,引发众怒了。 有博主发布檄文控诉,列举了英伟达显卡 质量缺陷、抢购困难 等N条"罪状"。 而且这名博主用的言辞也毫不客气,直言英伟达就是一坨*。 这篇近6000字的博文当中,博主Sebin把英伟达从产品到销售策略一整个批判了一番,具体指控包括但不限于: 一石激起千层浪,博主的这篇文章在Hacker News上也引发了大量讨论。 有网友直言, 高端GPU已经逐渐成为了奢侈品 。 但也有人认为,老黄凭借GPU和软硬件技术抓住了每一个市场机遇,接下来也是如此,并且英伟达将会长期领先。 产品质量不过关,出现烧接口、缺少ROPs单元等故障; 销售策略不合理,存在库存不足、捆绑销售、黄牛囤货等诟病; 护城河将技术锁死,破坏向下兼容性; 垄断市场,操控媒体及评测机构。 那么,博主的这篇檄文,具体都说了些什么呢? 注:本文中观点仅为原博主和评论发布者所有。 故障频现,产品质量堪忧 以今年新发布的50系显卡为例,最知名的故障之一便是5090的 烧接口事件 。 而且这个故障也不是50系开始才有的,在4090当中就已经出现过。 更加抓马的是, 在50系显卡发售 ...
Karpathy最新脑洞「细菌编程」:优秀的代码应该具备细菌的三大特质
量子位· 2025-07-07 04:02
Core Viewpoint - The article discusses Andrej Karpathy's new concept of "Bacterial Code," which emphasizes small, modular, self-contained code blocks that are easy to copy and paste, inspired by the evolutionary strategies of bacteria [1][5][6]. Group 1: Concept of Bacterial Code - Bacterial Code has three main characteristics: small code blocks, modularity, and self-containment, allowing for easy replication [1][6][12]. - The idea is that open-source communities can thrive through "horizontal gene transfer," similar to how bacteria share genetic material [2][12]. - Karpathy's insights are derived from the survival strategies of bacteria, which have evolved to colonize diverse environments through efficient genetic coding [7][8]. Group 2: Principles of Bacterial Code - The first principle is "smallness," where each line of code consumes energy, leading to a natural self-optimization mechanism [8][11]. - The second principle is "modularity," where code should be organized into interchangeable modules, akin to bacterial operons, promoting high cohesion and low coupling [11][12]. - The third principle is "self-containment," meaning code snippets should be independent and not reliant on complex configurations or external libraries [13][14]. Group 3: Limitations and Future Directions - While Bacterial Code is effective for rapid prototyping, it is not suitable for building complex systems, which require more intricate structures like eukaryotic genomes [15][16]. - Karpathy suggests a hybrid approach, utilizing the strengths of both bacterial and eukaryotic coding strategies [16]. Group 4: Evolution of Software Development - Karpathy has previously introduced concepts like Software 3.0, which represents a shift towards programming with natural language models [18][25]. - He notes that software has undergone significant transformations in recent years, moving from traditional coding to model training and now to natural language programming [19][23][31]. - The future of software development will involve a collaboration between humans and large models, leading to semi-autonomous applications [28][30]. Group 5: Context Engineering - Context Engineering is highlighted as a crucial skill for effectively utilizing large language models (LLMs), requiring a balance of information to optimize performance [36][39]. - This discipline involves understanding the behavior of LLMs and integrating various elements like task descriptions and multimodal data [40][41].
蒙娜丽莎让大模型们几乎全军覆没!网友:懂了,AI不会眯眼睛
量子位· 2025-07-06 05:12
Core Viewpoint - The article discusses the challenges faced by large AI models in recognizing a specific artwork created by Japanese artist Akiyoshi Kitaoka, highlighting their limitations in visual perception and recognition tasks [1][3]. Group 1: AI Model Performance - ChatGPT could only identify the image as a face but failed to recognize the specific individual depicted [4][12]. - Gemini misidentified the person entirely, showcasing a significant error in recognition [6][15]. - Grok was unable to recognize the image and requested a clearer photo, indicating a lack of capability in handling such visual tasks [16]. Group 2: Domestic AI Model Analysis - Domestic model Doubao performed similarly to Gemini, recognizing the image style and facial contours but failing to identify the specific person [18]. - Doubao's deep thinking mode led it to incorrectly conclude that the image depicted Albert Einstein, demonstrating a flawed reasoning process [20]. - Qwen3-235B-A22B identified the image as a silhouette but did not specify the individual, reflecting a partial understanding of the visual content [21][22]. Group 3: Successful Recognition - The o3-Pro model stood out by successfully recognizing the artwork, attributed to its stronger reasoning capabilities compared to its non-Pro counterpart [26][29]. - There were discussions about whether o3-Pro used search capabilities to achieve its recognition, but it was clarified that it did not rely on search functions [31]. - The article suggests that prompting the model with hints about the artwork could lead to better recognition outcomes, akin to a guessing game [34].
谁是余家辉?“年薪1亿美元”AI研究员,中科大少年班天才,吴永辉的嫡系弟子
量子位· 2025-07-06 05:12
邓思邈 发自 凹非寺 量子位 | 公众号 QbitAI 余家辉 。 一个在中文互联网几乎隐形的名字,却让硅谷两大AI巨头撕破脸皮。 浙江慈溪人,高二就被招去中科大少年班,博士师从"计算机视觉之父" 黄煦涛 ,现字节Seed掌舵人 吴永辉 的嫡系弟子,还与豆包核心技术 大牛 杨建朝 的上升轨迹有着惊人的重合…… 扎克伯格亲自下场挖他,传出 "1亿美元年薪" ,刷新AI人才市场纪录。 奥特曼在一旁急得跳脚,公开指控Meta的作派"令人反感",杀人诛心说"总有人唯利是图"。OpenAI内部员工更是哀嚎:这是巨大损失。 2016年从中科大毕业后,只身前往UIUC (美国伊利诺伊大学厄巴纳-香槟分校) 读博,师从AI视觉一代华人教父 黄煦涛 ,让他打下了扎实 的学术基础。 网友甚至调侃: AI顶级研究员收入堪比C罗转会费,但知名度连十八线网红都不如。 一时间,AI圈比娱乐圈还精彩。 更魔幻的是,当全世界都在讨论这个天价薪酬包时,当事人却像人间蒸发一样,没有回应,没有声明,甚至连个动态都没发。 所以,余家辉到底是何许人也?他凭什么? 无法复制的履历组合 余家辉,刚好 30岁 ,出生于1995年,打小天资聪颖,2012年从 ...
老黄再收95后华人才俊!4亿美元收购AI初创公司
量子位· 2025-07-06 05:12
(跟小扎斥巨资招人相比,这真的不算多) 这家位于多伦多的初创公司CentML,由95后华人 王尚 和其导师共同创立,专门负责优化AI应用程序的运行方式。 王尚,多伦多大学博士毕业,曾在英伟达作为高级软件工程师参与深度学习的GPU性能优化工作,后续又在CentML担任首席技术官,这次再 度加盟英伟达,也算是重回老东家。 鹭羽 白交 发自 凹非寺 量子位 | 公众号 QbitAI 又一家95后华人AI初创,被老黄收购! 仅四亿美金的收购金额,就把员工全部打包带回英伟达。 目前王尚和其余三位联合创始人以及15名工程师都已全部被英伟达收入麾下。本次收购有望增强英伟达CUDA工具链,为开发者提供更高效的 AI模型部署方案。 前有小扎"1亿美元年薪"挖走8名OpenAI核心成员,现在英伟达老黄也马不停蹄加入战局,以收购的方式密集吸纳人才。 最近硅谷是捅了人才争夺战的马蜂窝? 英伟达四亿美元收购初创 本次收购,老黄可谓是 势在必得 。 据外媒报道,同时还有多家公司也对CentML表达了竞标兴趣,而英伟达以超 4亿美元 的交易总价最终成功拿下这家AI初创企业。 其中基础收购金额预估超3亿美元,另包含与业绩目标挂钩的额外盈利支 ...
Diffusion约2倍无损加速!训练-推理协同的缓存学习框架来了| HKUST&北航&商汤
量子位· 2025-07-06 05:12
Core Viewpoint - The article presents a new caching acceleration solution called HarmoniCa, which addresses the slow inference speed and high costs associated with diffusion models, particularly the Diffusion Transformer (DiT) architecture, achieving high-performance lossless acceleration [1][7][30]. Group 1: HarmoniCa Framework - HarmoniCa is designed to overcome the speed bottlenecks of the DiT architecture during deployment, enabling efficient training and inference collaboration through a feature caching acceleration framework [1][30]. - The framework introduces two key mechanisms: Step-Wise Denoising Training (SDT) and Image Error Proxy Objective (IEPO), which align training and inference processes to enhance performance [10][15][16]. Group 2: Mechanisms of HarmoniCa - SDT simulates the entire denoising process during training, reducing error accumulation and improving final image clarity and stability [11][12][15]. - IEPO focuses on optimizing the final image quality rather than intermediate noise errors, ensuring that the training objectives align with the ultimate goal of high-quality image generation [15][16]. Group 3: Experimental Results - HarmoniCa was tested against various methods, including Learning-to-Cache (LTC) and heuristic caching methods, demonstrating superior performance in terms of image quality and acceleration [17][19][20]. - In high compression scenarios (10-step inference), HarmoniCa maintained image quality advantages, achieving lower FID scores compared to LTC while improving cache utilization [19][22]. Group 4: Comparison with Other Techniques - HarmoniCa outperformed traditional pruning and quantization methods, providing stable acceleration without relying on specialized hardware, making it a more versatile deployment option [21][24]. - The framework showed compatibility with quantization techniques, enhancing inference speed while maintaining image quality, indicating its potential as an "acceleration plugin" [24][25]. Group 5: Cost Analysis - HarmoniCa demonstrated significant advantages in both training and inference costs, with a training time reduction of approximately 25% compared to mainstream methods, and minimal impact on throughput during inference [27][28]. - The lightweight nature of the added Router in the inference process ensures that it occupies only 0.03% of parameters, contributing to its efficiency [28]. Group 6: Conclusion - The article concludes that the HarmoniCa framework represents a new paradigm in caching acceleration, emphasizing the importance of synchronized training and inference processes to achieve optimal performance, efficiency, and adaptability in real-world deployments [29][30].
对话AI记账TOP1 「咔皮记账」:小众赛道半年实现百万级用户,AI初创产品如何挖掘增量市场
量子位· 2025-07-05 09:59
Core Viewpoint - The article discusses the emergence and growth of the AI bookkeeping app "Kapi Bookkeeping," which positions itself as a personal CFO for young people, leveraging AI to simplify and enhance the bookkeeping experience, resulting in over one million users within six months [2][5][41]. Group 1: Product Overview - Kapi Bookkeeping is designed as an AI-native personal life assistant targeting young adults aged 22 to 30, primarily in first and second-tier cities, who are beginning to recognize the importance of financial management [7][8]. - The app offers features such as AI bookkeeping (text/voice/multi-modal), AI budgeting, financial analysis, and multi-asset account management, making bookkeeping easier and faster [5][9]. - Kapi Bookkeeping has achieved a leading position in the AI bookkeeping sector, with over one million users in just six months [5][41]. Group 2: Market Positioning and User Engagement - The app addresses the challenge of maintaining bookkeeping habits among users, recognizing that while many want to track their spending, few can sustain the practice due to the tedious nature of traditional bookkeeping methods [8][9]. - Kapi Bookkeeping utilizes AI to streamline the bookkeeping process, making it less burdensome and more appealing to potential users who previously found it difficult to maintain [9][12]. - The product development process involves continuous feedback from users, allowing for iterative improvements based on real-world usage [5][26]. Group 3: User Experience and Functionality - The most praised feature is the AI bookkeeping process, which automates data extraction from user inputs, significantly reducing manual entry [19][24]. - The app also includes a "life timeline" feature that enhances user experience by contextualizing spending behavior within a timeline, making it more relatable [19][24]. - Kapi Bookkeeping aims to evolve beyond simple bookkeeping to become a comprehensive financial agent, providing proactive suggestions and insights based on user data [46][47]. Group 4: Future Directions and Challenges - The company acknowledges the rapid evolution of AI technology and the need to adapt to new developments in AI models to maintain a competitive edge [49]. - Kapi Bookkeeping's long-term goal is to effect positive changes in users' financial behaviors, such as improving savings rates and managing debt more effectively [35][41]. - The app currently does not charge users, focusing instead on refining the user experience before considering monetization strategies [41][42].