Workflow
量子位
icon
Search documents
清华唐杰新作:大模型能打掼蛋吗?
量子位· 2025-09-10 10:01
Core Viewpoint - The research indicates that large models can effectively play various card games, demonstrating their capabilities in complex decision-making scenarios [2][4][52]. Group 1: Model Performance - Different models exhibit varying performance across different card games, with fine-tuned models showing superior results compared to API-based and base models [3][40]. - Among the API-based models, GPT-4o performs the best overall, while GLM-4 demonstrates strong capabilities in games like DouDizhu and GuanDan [39][40]. - Fine-tuned models, particularly GLM4-9B-Chat-mix, excel in multiple games, including DouDizhu, GuanDan, and Uno, indicating their versatility [42][40]. Group 2: Game Selection and Learning Methodology - The research team selected eight popular card games based on their complexity and the availability of high-quality models and data [8]. - The learning process involved generating high-quality interaction data through teacher models and opponents, allowing the large language models to learn effectively [14][16]. - The complexity of the games influenced the number of training instances collected, with more complex games like DouDizhu and GuanDan requiring larger datasets [20][21]. Group 3: Inter-Game Influence - The study found that models trained on similar games can enhance each other's performance, while those trained on games with significant rule differences may experience performance conflicts [52][49]. - For instance, models trained on GuanDan showed good performance in DouDizhu, suggesting a positive transfer of skills between these games [45]. Group 4: Generalization and Capability - The research indicates that while training on card games, the general capabilities of the models may decline, but this can be mitigated by incorporating general data into the training process [56][54]. - The mixed training approach allowed for some recovery of general capabilities, demonstrating the balance between specialized game skills and broader knowledge [56].
Qwen又立功,全球最快开源模型诞生,超2000 tokens/秒!
量子位· 2025-09-10 10:01
Core Viewpoint - The article discusses the launch of K2 Think, the world's fastest open-source AI model, developed by MBZUAI and G42 AI, achieving a speed of over 2000 tokens per second with only 32 billion parameters [1][3][8]. Group 1: Model Performance - K2 Think has demonstrated a processing speed exceeding 2000 tokens per second, with specific tests showing speeds of 2730.4 tokens/second and 2224.7 tokens/second [10][14][18]. - The model has performed well in various mathematical benchmark tests, achieving scores such as 90.83 in AIME'24 and 81.24 in AIME'25 [25]. Group 2: Technical Innovations - K2 Think incorporates several technical innovations, including: 1. Supervised fine-tuning for long-chain reasoning, allowing the model to think step-by-step rather than providing direct answers [31]. 2. Reinforcement learning with verifiable rewards, enhancing performance in mathematics and logic [31]. 3. Intelligent planning before reasoning, enabling the model to outline solutions before detailed reasoning [31]. 4. Best-of-N sampling during reasoning to generate multiple answers and select the best one [31]. 5. Speculative decoding to parallelly generate and verify answers, reducing redundant calculations [31]. 6. Hardware acceleration using Cerebras WSE, facilitating the high-speed output [31]. Group 3: Model Background - K2 Think is based on the Qwen 2.5-32B model from HuggingFace, indicating a connection to Chinese technology [6][5]. - Despite having only 32 billion parameters, K2 Think claims to match the performance of flagship models from OpenAI and DeepSeek [24].
快手AI超级员工上线!一句话剪出完整短视频,从文案到发布一条龙
量子位· 2025-09-10 08:01
Core Viewpoint - The article discusses the launch of Kwali, an AIGC (Artificial Intelligence Generated Content) tool from Kuaishou that enables users to create complete promotional videos in just a few minutes by simply stating their requirements, significantly lowering the barriers to video production [1][2][37]. Group 1: Functionality and Features - Kwali integrates multiple agents into a single framework to assist in video creation, including a material library and digital human resources, allowing users to generate high-quality short videos without prior filming skills [2][4]. - The process involves several agents that handle different tasks: intent analysis, script generation, material matching, and editing, all of which can operate independently and in parallel [5][18][42]. - Users can upload their private materials, which the system will automatically tag for easy future access, facilitating seamless integration with the platform's material library [14]. Group 2: Production Process - The video creation process starts with Kwali breaking down the user's request into key selling points, audience, and context tags, followed by script writing and material selection [8][22]. - The script includes dialogue and corresponding visual descriptions, designed to capture audience attention, and is generated based on analysis of popular videos in the same category [28][30]. - After gathering the necessary visual materials, Kwali matches appropriate fonts and background music, and synthesizes voiceovers using TTS technology before final editing [33][35]. Group 3: Industry Impact - The introduction of Kwali represents a fundamental shift in the video production supply chain, reducing the need for extensive resources and time traditionally required for creating promotional content [37][38]. - The new model allows small businesses and individual brands to produce content more frequently and affordably, transforming video marketing into a lightweight tool for daily operations [40][45]. - The streamlined process enables rapid testing of new creative ideas, allowing businesses to quickly adapt and respond to market demands, ultimately enhancing their marketing efficiency [44][46].
真·博士水平!GPT-5首次给出第四矩定理显式收敛率,数学教授只点拨了一下
量子位· 2025-09-10 08:01
Core Insights - GPT-5 has successfully extended the qualitative fourth moment theorem to a quantitative form with explicit convergence rates, marking a significant advancement in mathematical research [1][2][10]. Group 1: Research Achievements - The original theorem indicated that convergence would occur but did not specify the speed of convergence; GPT-5's contribution clarifies this aspect [2]. - OpenAI co-founder Greg Brockman expressed satisfaction with the progress made using GPT-5 in mathematical research [4]. - GPT-5 Pro improved known boundary values in convex optimization from 1/L to 1.5/L within minutes, showcasing its capabilities [8]. Group 2: Research Methodology - A controlled experiment was conducted by three mathematics professors using the Malliavin–Stein framework to test GPT-5's ability to generalize the fourth moment theorem [9][10]. - Initial prompts were based on a paper that established a qualitative fourth moment theorem applicable to two Wiener–Itô integrals with differing parity [11]. - GPT-5 provided a generally correct conclusion but made errors in reasoning that could jeopardize the proof's validity [13][14]. Group 3: Iterative Improvement - Upon identifying errors, researchers prompted GPT-5 to check its formulas and provide detailed derivations, leading to further corrections [15]. - GPT-5 was able to format the results into a research paper structure, including an introduction, main theorem statements, and a complete proof process [17]. - The AI suggested that the method could be extended to non-Gaussian frameworks, indicating its potential for broader applications [20]. Group 4: Further Exploration - Researchers aimed to extend the findings to Poisson cases, recognizing structural differences between Gaussian and Poisson scenarios [21][24]. - GPT-5 initially overlooked a critical fact regarding non-negativity in Poisson cases but was able to correct itself after specific guidance from researchers [26][28]. Group 5: Publication Challenges - The authors initially intended to list GPT-5 as a co-author but were informed by arXiv that AI cannot be credited as an author [29]. - Ultimately, the paper was submitted without GPT-5 listed as an author, reflecting ongoing discussions about AI's role in academic contributions [30].
腾讯版“Claude Code”来了!AI编程L4时代is coming
量子位· 2025-09-10 08:01
Core Viewpoint - Tencent has launched the AI CLI tool CodeBuddy Code and opened public testing for CodeBuddy IDE, marking a significant step in AI programming tools, particularly in the CLI format, which is becoming a foundational infrastructure for enterprise-level development [1][3][14]. Group 1: Product Overview - CodeBuddy IDE is an independent AI IDE currently in public testing, with the domestic version being free and the international version offering a limited Pro model experience during the testing phase [2][3]. - CodeBuddy Code is designed for professional engineers, allowing natural language to drive the entire development and operations lifecycle, enhancing automation efficiency [3][23]. - The product matrix includes CodeBuddy IDE, CodeBuddy Code, and CodeBuddy plugins, with the latter already officially launched and available for free use [3][8]. Group 2: Market Context - The emergence of CodeBuddy Code comes at a time when developers are moving away from Claude Code due to recent controversies, positioning Tencent's offering as a timely alternative [6]. - The AI CLI format, pioneered by Claude Code, has changed the market landscape, integrating traditional CLI advantages with AI capabilities suitable for automation and enterprise development [11][14]. Group 3: Development Trends - AI programming tools are evolving through five levels, with the CLI format representing a significant advancement, allowing AI to transition from a supportive role to a driving force in software engineering [11][16]. - The CLI mode is particularly advantageous for enterprise-level teams, covering the entire software lifecycle from task breakdown to deployment [19][20]. Group 4: Performance Metrics - Tencent reports that over 90% of its engineers are using CodeBuddy, resulting in an average coding time reduction of over 40%, with AI-generated code accounting for more than 50% of the total [20][21]. - The proportion of AI-generated code in code reviews has increased from 12% to 35%, indicating a growing reliance on AI in the development process [20]. Group 5: Features and Functionality - CodeBuddy Code supports natural language interaction, allowing users to describe tasks without needing to learn complex commands, and manages project context in a traceable and shareable manner [26][24]. - The platform integrates seamlessly with Git, CI/CD, and monitoring systems, facilitating high-efficiency collaboration among multiple agents [25][26]. - The memory system of CodeBuddy Code includes project memory, user memory, and global memory, enabling long-term context management across projects [29]. Group 6: Future Directions - The CLI-driven intelligent programming platform represents a new direction for enterprise-level AI programming, transforming developers into AI collaborative architects [37][38].
快慢思考不用二选一!华为开源7B模型实现自由切,精度不变思维链减近50%
量子位· 2025-09-10 08:01
允中 发自 凹非寺 量子位 | 公众号 QbitAI 国产 自研开源模型 ,让模型不用在快思考和慢思考间二选一了! 华为最新发布 op enPangu-Em bedded-7B-v1.1 ,参数只有7B,却身怀 双重"思维引擎" 。 要知道,长期以来,大模型快思考与慢思考模式不可兼得,这成为业界的一大痛点。在当前大模型混战中,各家巨头都在寻求破局之道, 但此前开源领域一直缺乏一款可自由切换快慢思维模式的模型。 要快,还是要慢?AI在面对不同难度的问题时也有"选择困难症"。 而现在,openPangu-Embedded-7B-v1.1,通过 渐进式微调策略 和 独特的快慢思考自适应模式 ,既支持手动切换"快思考"或"慢思考"模 式,也能根据问题难度自动在两种思维模式间无缝转换。 简单问题它秒答如飞,复杂任务它深思熟虑,一举填补了开源大模型在这一能力上的空白,让效率与准确率实现双赢。 在通用、数学、代码等多个权威评测中,该 模型精 度相较于此前模型 大幅提升,且 引入模式自动切换并没有牺牲精度 。在CMMLU等基 准中,openPangu-Embedded-7B-v1.1 保持精度的同时,平均思维链长度缩短近50 ...
首个Data Agent基准测试来了!2007个测试任务将数据库、PDF、视频、音频异构数据源一网打尽
量子位· 2025-09-10 08:01
FDABench团队 投稿 量子位 | 公众号 QbitAI 数据智能体到底好不好用?测评一下就知道了! 南洋理工大学、新加坡国立大学携手华为 开源 推出 首个专门针对数据智能体(Data Agents)异构混合数据分析的综合性基准测试 FDABench 。 该基准横跨50+数据领域、设置了多种难度等级和任务类型,还独创了 Agent-Expert协作框架 ,确保测试用例质量和数据一致性,同时支 持Data Agent、RAG、语义算子以及四种典型Data Agent工作流模式。 团队使用FDABench对各种数据智能体系统进行了评估,发现每个系统在响应质量、准确性、延迟和token成本方面都表现出独特的优势。 下面详细来看。 将 数据库、 PDF、视频、音频异构数据源一网打尽 面对数据驱动决策的需求日益增长,这催生了对能够整合结构化和非结构化数据进行分析的数据智能体的迫切需求。 △ Data Agent 样例 为应对这些挑战,团队提出了 FDABench ,这是首个专门为评估多源数据分析场景中的智能体而设计的数据智能体基准。 首先,由于难以设计出能评估智能体在多源分析任务中各项能力的测试用例,全面的数据智能 ...
英伟达新GPU,超长上下文/视频生成专用
量子位· 2025-09-10 01:28
henry 发自 凹非寺 量子位 | 公众号 QbitAI 老黄对token密集型任务下手了。 刚刚,在AI Infra Summit上,英伟达宣布推出专为处理 百万token 级别的代码生成和 生成式视频 应用的全新GPU—— NVIDIA Rubin CPX GPU 。 老黄表示:Rubin CPX是 首款 为超大上下文AI量身定制的CUDA GPU,可以让模型"一口气"推理数百万token。 而且,RubinCPX还能让你越用越省钱:每投资 1亿 美元,就能获得 50亿 美元的token收益。 (50倍,你就赚吧,老黄说的) 对于"老黄画的饼", Cursor 、 Runway 、 Magic 等行业大佬也表示RubinCPX将分别在 代码生产力 、 生成式影像创作 、以及 大模型 自主代理 上带来突破。 那么好了好了,这GPU到底什么来头? 首款专为超大上下文AI打造的CUDA GPU Rubin CPX基于NVIDIA Rubin架构,采用单片设计,内置NVFP4计算资源,主打AI推理的高性能和高能效。 它的性能提升,主要体现在以下几个方面: 在这里,我们可以简单地拿A100来对比一下。 在算力方面 ...
Claude用户退订潮!被指高峰期偷换缩水模型,工程师列9大罪状呼吁全网退订
量子位· 2025-09-10 01:28
梦晨 发自 凹非寺 量子位 | 公众号 QbitAI 点赞者就2000多,用实际行动退订的也不少。 退订者中有最高价 20倍Max套餐 的重度用户。 原本在开发者社区口碑甚好,甚至 Claude Code单产品年化收入估算达到5亿美元 的Anthropic,到底因何犯了众怒? 工程师Ahmad Osman细数几大罪状: 就这甚至还没列完,可想而知这位开发者有多愤怒了。 Claude出现大危机,不是因为最近的某些骚操作,而是 产品本身就出了问题 。 已经有AI工程师带头呼吁大家退订(这里PoS指Piece of Shit,也就是一坨 )。 评论区有人补充,最糟糕的是模型悄悄变差,而你白白浪费了一小时才能意识到,没有哪个专业的开发环境是不能固定版本的。 好,现在骂也骂了退也退了,活还是得干,总不能退回到古法手工写代码吧。 那么以后用啥? 有很多人集中转投去了隔壁 OpenAI Codex ,甚至惊动了奥特曼本曼。 OpenAI Codex强势崛起 如果你在前几天打开美国贴吧Reddit的Claude Code吧,就会发现怎么全是讨论OpenAI Codex的,都要怀疑是不是走错门了。 在白天高峰时段,用到的是缩水 ...
库克挤爆牙膏!5999元iPhone17上高刷,新款耳机能测心率+同传
量子位· 2025-09-09 20:23
按库克的说法,这波新品一切都以设计为核心。 效果上看,iPhone系列的镜头模组也确实基本告别了过去的"浴霸"模式。 克雷西 鱼羊 发自 凹非寺 量子位 | 公众号 QbitAI 标准版iPhone终于也用上高刷了! 刚刚结束的苹果春晚上,iPhone、AirPods Pro和Apple Watch相继登台亮相。 | ... | 0000 | 00000 | | --- | --- | --- | | 新款 | 新款 | 新款 | | iPhone 17 Pro | iPhone Air | iPhone 17 | | 创新设计,打造巅峰性能和超 | 迄今最薄 iPhone,身藏高 | 拉高好感度,再添耐用性。 | | 长续航。 | 能内核。 | | 当然最令果粉激动的,还是这次iPhone全系都安排了高刷。 确实,型号基础,刷新率就不基础。 除了iPhone,耳机和手表也迎来重要升级,牙膏直接挤爆。 比如AirPods Pro也变身智能穿戴,不仅能够进行同声传译,还支持心率检测。 还有Apple Watch也支持了5G通信,还新增了重磅健康功能。 不得不说这一波库克的刀法是真的变温柔了 (希望英伟达的老黄也 ...