量子位

Search documents
告别玄学选LLM!弗吉尼亚理工选型框架入选ICML 2025
量子位· 2025-06-18 04:58
Core Viewpoint - The article discusses the introduction of the LensLLM framework by researchers from Virginia Tech, which significantly improves model selection efficiency while reducing costs by nearly 90% [1][2]. Group 1: Model Selection Challenges - In the era of rapidly emerging large language models (LLMs), selecting the right model has become a major pain point for AI engineers and researchers [2][6]. - Existing methods rely heavily on experience and trial-and-error, making it difficult to balance cost and effectiveness [8][9]. - Incorrect model selection can lead to wasted GPU resources, slowed product iteration, and even project failures [7]. Group 2: LensLLM Framework - LensLLM aims to end the era of model selection based on intuition, providing a theoretical foundation derived from a new PAC-Bayes generalization bound [9]. - The framework reveals the nonlinear performance changes of LLMs during fine-tuning based on different data scales [9][10]. - It introduces a phase transition phenomenon in model performance, transitioning from "pre-power law" to "power law" as data volume increases [13][14]. Group 3: Performance Prediction and Cost Efficiency - LensLLM utilizes a neural tangent kernel (NTK) enhanced scaling law model to accurately predict model performance with minimal data fine-tuning [18][19]. - The framework demonstrates superior prediction accuracy compared to baseline methods, achieving RMSE errors as low as one-fifth of previous methods [22][24]. - It significantly reduces computational costs, with a maximum reduction of 88.5% compared to full tuning methods while maintaining a high selection accuracy of 91.1% [26][27]. Group 4: Future Applications - LensLLM is positioned not only as a model selection tool but also as a potential core component for model evaluation and management [28]. - Future explorations will include expanding LensLLM to multi-task environments and complex model structures, aiming to create a more universal intelligent model selection system [28].
AI玩宝可梦找出30年前代码Bug!谷歌论文介绍AI通关全过程,复杂任务都能解
量子位· 2025-06-18 04:58
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 谷歌Gemini 2.5系列大模型技术报告发布,一大重点居然是AI玩《宝可梦》? 没错,就是那个童年回忆里的游戏,谷歌花超长篇幅介绍了Gemini 2.5 Pro玩《宝可梦蓝》时的具体行为,70页的论文,Pokemon关键词出 现59次。 其中特别报告了当AI控制的游戏角色濒临死亡时,Gemini 2.5 Pro会陷入"恐慌"状态,导致模型推理能力显著下降,甚至会忘记使用一些基本 功能,比如寻路工具。 这种恐慌行为出现过很多次,甚至观看直播的观众都已经能通过AI的行为模式,准确判断它什么时候在"恐慌"了。 事情开始于3月底,一位独立开发者Joel Zhang在Twitch上搭建了一个"Gemini玩宝可梦"的直播间,最初的目标只是直播播展示能玩完整游戏 的智能体工具的开发过程。 结果Gemini 2.5 Pro超出预期,测试期间直接把游戏打通关了,成为宝可梦联盟冠军,进入名人堂,走上AI生巅峰。 虽然整个过程用了831个小时,相比人类玩家平均只需要几十个小时差得很远。但在正式使用固定的智能体工具打第二次时,通关时间只用了 一半。 AI展现惊人游戏水平,复杂 ...
MiniMax秀了波AI视频杂技:越看越惊艳,指令遵循太强了
量子位· 2025-06-18 00:54
白交 发自 凹非寺 量子位 | 公众号 QbitAI 这样复杂精致的视频效果,都是AI生成的?都是最新国产AI大模型的新能力?? 没错,都来自MiniMax刚刚发布海螺2.0版本,能处理极端物理情况,原生支持1080P。 它可以这样—— 提示词:The character in the frame juggles throwing knives with fast and fluid motion. 画面中的人物以快速、流畅的动作玩弄投掷刀具的游戏 即便是这种快速变化的场景也可以hold。 官方介绍说,这次新升级的大模型,在指令遵循、生成质量都达到了一流水平,其成本效率破纪录。 Hailuo02 在官方释出的最新案例中,能够看到此次升级的一些细节。 还可以在空中旋转跳跃不停歇—— 提示词:Acrobatic performance:a performer swings rapidly on an aerial executing high-difficulty moves as the camera follows. 杂技表演:表演者在空中快速摆动,做出高难度动作,镜头跟随。 比如在光影处理上。 即便是比较超 ...
大模型也需要自我反思,上海AI Lab合成“错题本”让大模型数学成绩提升13.3%
量子位· 2025-06-18 00:54
作者通过深入分析模型犯下的错误,构建了"错误-纠正"数据集,并利用反思机制,引导模型从错误的思路平滑过渡到正确的答案。 LEMMA项目组 投稿 量子位 | 公众号 QbitAI 大模型学习不仅要正确知识, 还需要一个"错题本" ? 上海AI Lab提出了一种新的学习方式, 构建了"错误-反思-修正"数据 ,让大模型仿照人类的学习模式,从错误中学习、反思。 结果,在Llama3-8B上,数学题的解题 准确率平均提升了13.3% 。 这种方法名为 LEMMA (Learning from Errors for Mathematical Advancement) ,专门教大模型如何从错误中学习。 结果,模型不仅获得了准确率的提升,还获得了超强的自主纠错能力和泛化能力。 相关论文已发表于ACL'25 Findings。 作者首先系统分析了当前主流大模型在数学题中常见的七大类错误 (如题意误解、公式混淆、计算失误等) ,发现这些错误在不同模型之间 分布非常一致。 结果显示,大模型犯下最多的错误是误解题意,占比超过40%,随后的两张常见错误类型是公式混淆和计算错误。 | Error Type | Definition | ...
Agent创业来了位13岁的CEO
量子位· 2025-06-17 09:16
Core Viewpoint - The article highlights the rising trend of young entrepreneurs in the AI startup space, exemplified by 13-year-old Michael Goldstein, who founded FloweAI, a company focused on developing a general AI agent for various tasks [1][2][3]. Group 1: Company Overview - FloweAI is an AI startup founded by Michael Goldstein, who aims to create a versatile AI agent capable of performing tasks such as PPT creation, document writing, and flight booking [2][3]. - The company has set a business goal of generating $10,000 in monthly revenue and aims to expand its operations to a million-dollar scale [3][31]. - FloweAI currently supports web-based usage and offers a free tier for users to test up to 10 tasks per month, with a paid Pro version available for CAD 20 (approximately RMB 105) for unlimited access and advanced features [6][7]. Group 2: Product Features - The AI tool can generate presentations, with a recent test resulting in a 10-page PPT on the development of the Agent industry, completed in 6.5 minutes [11][13]. - FloweAI generates corresponding file code, allowing users to modify presentations directly within the workspace [15]. - The platform is continuously evolving, with plans to add more features such as Gmail management and improved task handling capabilities [25]. Group 3: Market Position and Feedback - Compared to more established AI tools like Manus and Genspark, FloweAI's PPT creation capabilities are still developing, with users noting issues such as text and frame size mismatches and basic content depth [18][20][21]. - User feedback indicates that while the tool enhances visual appeal, it lacks detailed content and advanced editing features [20][21]. Group 4: Young Entrepreneurs Trend - The article emphasizes a growing trend of young individuals entering the AI entrepreneurship space, with examples like a 10-year-old developing a SaaS tool for monitoring phone numbers [33][35]. - This trend showcases the enthusiasm and initiative of younger generations in exploring AI's potential, suggesting that age is not a barrier to technological innovation and entrepreneurship [36].
吉卜力风「手游」爆火,可灵+Midjourney生成的!教程已出,支持复刻
量子位· 2025-06-17 09:16
Core Viewpoint - The article discusses the emergence of AI-generated games in the style of Studio Ghibli, highlighting their aesthetic appeal and the technology behind their creation [4][8]. Group 1: AI Game Creation - The games are created using AI tools like 可灵AI and Midjourney, where creators provide text prompts to generate visuals and videos [3][8]. - Users can replicate similar effects easily by following the provided guidelines [9][14]. - The article showcases examples of AI-generated scenes, including fishing and market exploration, emphasizing the immersive and interactive potential of these creations [12][22]. Group 2: Market Impact and Growth - 可灵AI has shown rapid growth, with an annual revenue run rate exceeding $100 million as of March, and monthly revenues surpassing 100 million RMB in April and May [32][33]. - According to a report by 中金, 可灵AI's market share in global AI film tools has reached 30.7% [34].
网页编程众测排名:DeepSeek-R1超越Claude 4加冕全球第一
量子位· 2025-06-17 07:41
一水 发自 凹非寺 量子位 | 公众号 QbitAI 它在LiveCodeBench上几乎与OpenAI o3-high相当,乃至一众网友猜测其为传说中的R2。 编程王者Claude地位不稳了?? 大模型竞技场最新战报出炉, DeepSeek新版R1拿下网页编程第一,小胜Claude Opus 4 。 要知道Claude Opus 4可是公认的"全球最强编码模型"。 so,能在编程上战胜 Claude Opus 4 ,DeepSeek-R1-0528到底啥来头? 看名字你可能以为是个小版本更新,但实际上—— | | | | 10/1/2024 | | 5/1/2025 | | --- | --- | --- | --- | --- | --- | | Rank | Model | Pass ... ↓ | | Easy… Medium… | I Hard ... | | 1 | 04-Mini (High) | 79.5 | 98.8 | 86.7 | 63.8 | | 2 | 03 (High) | 75.4 | 98.8 | 81.9 | 57.9 | | | | | 9 | | | | 4 | Deep ...
直击CVPR现场:中国玩家展商面前人从众,腾讯40+篇接收论文亮眼
量子位· 2025-06-17 07:41
Core Insights - The CVPR 2025 conference showcased significant participation from Chinese companies, highlighting their growing influence in the global AI and computer vision landscape [3][7][30] - The conference emphasized advanced topics such as multimodal and 3D generation technologies, with Gaussian Splatting emerging as a key focus area [6][15][17] - The acceptance rate for papers at CVPR 2025 was 22.1%, indicating a competitive environment and increasing recognition for high-quality research [11][13] Group 1: Conference Highlights - The conference received a record number of submissions, with 13,008 valid papers and 2,878 accepted, reflecting a growing interest in cutting-edge research [11] - Key topics included multimodal models, diffusion models, and large language models, with "multimodal" appearing 175 times in accepted paper titles [14] - The integration of computer vision and graphics was noted, with a significant rise in 3D-related research due to advancements in neural rendering [17][18] Group 2: Chinese Companies' Participation - Chinese companies, particularly Tencent, demonstrated strong engagement, with Tencent alone having over 40 accepted papers across various research areas [32] - The participation of Chinese firms in sponsorship and workshops indicates their commitment to advancing technology and attracting talent [34][36] - Tencent's investment in R&D reached approximately 70.686 billion RMB in 2024, showcasing their dedication to AI and technology development [44] Group 3: Talent Acquisition and Development - The conference served as a platform for companies to attract top talent, with Tencent's "Qingyun Plan" offering competitive salaries and career advancement opportunities [50][51] - The focus on technical talent is evident, with 73% of Tencent's workforce in technology roles, emphasizing the importance of skilled personnel in driving innovation [51] - The initiative aims to create a positive cycle where talent is nurtured and retained, contributing to the company's long-term technological advancements [46][48]
不用千亿参数也能合成高质量数据!这个开源框架让小模型“组团逆袭”,7B性能直追72B
量子位· 2025-06-17 07:41
Core Viewpoint - The GRA framework (Generator–Reviewer–Adjudicator) proposed by Shanghai AI Lab and Renmin University of China enables small models to collaboratively generate high-quality training data without the need for large-scale language model distillation [1][2][13]. Group 1: GRA Framework Overview - GRA operates on the principle of "multi-person collaboration" and "role division," simulating a peer review process to ensure data quality [7][12]. - The framework consists of three main roles: Generator, Reviewer, and Adjudicator, each contributing to the data generation and evaluation process [8][9][10]. Group 2: Experimental Results - GRA-generated data quality matches or exceeds that of single large language models across ten mainstream datasets, showing significant performance improvements [2][14]. - The GRA framework integrates five open-source small language models, demonstrating that collaboration among smaller models can yield competitive results against larger models [14][17]. Group 3: Performance Metrics - GRA-generated data improved training performance by an average of 6.18% on LLaMA-3.1 and 11.81% on Qwen-2.5 compared to original data [16]. - GRA's performance is only 0.59% lower than the Qwen-72B distilled version, while outperforming it by 8.83% when trained on Qwen-2.5 data [17]. Group 4: Advantages of GRA - GRA enhances data diversity and quality, filling gaps in the original seed data and providing a broader semantic coverage [18]. - The data quality is validated through a robust review process, with over 87.3% of samples receiving high consistency scores [19]. - GRA-generated data presents a higher task difficulty, increasing the effectiveness of training for small models [20].
AI操作有了“紧急刹车”!通义&自动化所AI决策诊断模型,GUI智能体纠错正确率SOTA
量子位· 2025-06-17 07:41
Core Viewpoint - The article discusses the introduction of the GUI-Critic-R1 model by Alibaba's Tongyi Lab in collaboration with the Chinese Academy of Sciences, which aims to diagnose decisions made by GUI agents before execution to prevent irreversible errors and unnecessary operations [1]. Group 1: Error Correction Examples - Example 1: The model successfully guided the agent to use the search box in the Joplin application to find a file instead of incorrectly navigating back [2]. - Example 2: The model identified an incorrect action of clicking the "Statistics" button and suggested clicking "Expense Log" instead to fulfill the task of deleting duplicate expenses [4]. - Example 3: The model advised terminating the task when the agent incorrectly decided to press the record button again while filming a video [6]. Group 2: Importance of Pre-Execution Feedback - In dynamic environments, errors made by GUI agents can lead to a series of subsequent failures, necessitating higher accuracy in single-step operations [8]. - Due to limited self-reflection capabilities, MLLMs often struggle to independently detect their own errors, highlighting the need for additional feedback mechanisms [9][10]. - Providing feedback on decision-making before executing actions is crucial to avoid dangerous and redundant operations [11][12][13]. Group 3: Implementation of GUI-Critic-R1 - The GUI-Critic-R1 model incorporates a pre-execution reflection mechanism to provide effective feedback for GUI automation tasks [16]. - A data collection pipeline was established, resulting in a dataset of 6,000 high-quality chain-of-thought annotations for training the model [16][21]. - The training method includes a reinforcement fine-tuning cold start and suggestion-aware group relative policy optimization to enhance the model's reasoning and generalization capabilities [17][18][26]. Group 4: Performance Evaluation - The GUI-Critic-R1 model demonstrated strong competitive performance across various scenarios, outperforming some closed-source models and validating the effectiveness of the S-GRPO approach [36][38]. - The model achieved the best success rate in the AndroidWorld benchmark, confirming its ability to detect errors and provide corrective suggestions effectively [38].