机器之心

Search documents
从繁杂技巧到极简方案:ROLL团队带来RL4LLM新实践
机器之心· 2025-08-22 04:58
本研究由淘天集团算法技术—未来生活实验室与爱橙科技智能引擎事业部联合完成 ,核心作者 刘子贺,刘嘉顺, 贺彦程和王维埙等 。未来生活实验室汇聚淘天 集团的算力、数据与顶尖技术人才,专注于大模型、多模态等前沿 AI 方向,致力于打造基础算法、模型能力及各类 AI Native 应用,引领 AI 在生活消费 领域的技术创新。爱橙科技则在大模型训练与优化方面具有丰富的实践经验。双方此前联合开源了高效大模型强化学习训练框架 ROLL,此次论文工作同样 是基于 ROLL 框架的实践探索。 近年来,强化学习(Reinforcement Learning, RL)在提升大语言模型(LLM)复杂推理能力方面展现出显著效果,广泛应用于数学解题、代码生成等任 务。通过 RL 微调的模型常在推理性能上超越仅依赖监督微调或预训练的模型。也因此催生了大量的相关研究。但随之而来的,是一系列令人困惑的现象: 不同研究提出了不同的 RL 优化技巧,却缺乏统一的实验对比和机制解释,有的甚至得出相互矛盾的结论。对于研究者和工程师而言,这种 "方法多、结论 乱" 的局面,反而增加了落地应用的难度。 为此,阿里巴巴淘天集团和爱橙科技联合多所高校,基 ...
那些让你「活人微死」的工作日,终于有救了
机器之心· 2025-08-22 04:58
Core Viewpoint - The article discusses the challenges faced by employees in companies due to inefficient workflows and communication, highlighting how the new features of WeChat Work 5.0 aim to streamline these processes through AI integration [3][4][15]. Group 1: Challenges in Current Workflows - Employees often encounter repetitive and frustrating situations where they cannot access necessary information quickly, leading to inefficiencies [3][4]. - Despite the push for AI adoption, many tools available in the market fail to integrate seamlessly into existing workflows, leaving employees to rely on manual processes [4][8]. - The lack of effective communication and information sharing among departments contributes to a stagnant growth environment, causing anxiety among management [4][8]. Group 2: WeChat Work 5.0 Features - WeChat Work 5.0 introduces features like intelligent search, intelligent summary, and intelligent forms, which aim to connect fragmented workflows and enhance internal collaboration [5][8][39]. - The intelligent search function allows employees to retrieve information across various platforms, making it easier to find relevant data without extensive manual searching [19][21]. - Intelligent summaries enable project updates to be generated automatically, reducing the burden of manual reporting and allowing employees to focus on more creative tasks [22][24]. Group 3: Integration and Efficiency - The integration of internal and external communication through WeChat Work allows for real-time updates and feedback, enhancing responsiveness to market changes [44][45]. - The intelligent forms feature automates data collection and analysis, transforming customer interactions into actionable insights for better decision-making [34][36]. - By centralizing data and communication, WeChat Work 5.0 provides a comprehensive solution that addresses the management challenges faced by growing companies [39][45]. Group 4: Case Study - BYD - BYD's experience with WeChat Work illustrates the platform's ability to scale with the company, growing from 100,000 to 1,000,000 employees while maintaining effective communication and collaboration [38][39]. - The platform is viewed as a "global optimal solution" that integrates various functions, unlike other tools that may excel in specific areas but lack overall cohesion [39][40]. Group 5: Future Outlook - The article emphasizes the importance of evolving from merely connecting with customers to creating deeper, more meaningful interactions through enhanced service capabilities [46]. - WeChat Work aims to facilitate this transition by leveraging AI to improve the quality and depth of connections between businesses and their clients [46].
谷歌Gemini一次提示能耗≈看9秒电视,专家:别太信,有误导性
机器之心· 2025-08-22 04:58
机器之心报道 机器之心编辑部 谷歌最近发布了一项关于其 AI 模型 Gemini 能源消耗的研究报告。 博客地址:https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference 技术报告: https://services.google.com/fh/files/misc/measuring_the_environmental_impact_of_delivering_ai_at_google_scale.pdf 报告中指出,处理一个中位数的 Gemini 文本提示仅消耗约 0.26 毫升水(约五滴)、0.24 瓦时电力(相当于观看电视不 到九秒),并产生 0.03 克二氧化碳排放。 注:中位数(Median)是统计学中用于描述数据集中趋势的指标之一。它是指将一组数据按大小顺序排列后,位于中间位置的数值。这里指研究人员 在对多次 Gemini 处理文本提示的资源消耗进行测量后,将所有的消耗数据(水量、电力、碳排放)分别进行了排序。 谷歌将这些较低的数值归功于其「全栈 ...
Cursor为Blackwell从零构建MXFP8内核,MoE层提速3.5倍,端到端训练提速1.5倍
机器之心· 2025-08-22 04:58
Core Insights - The article discusses the challenges and solutions encountered by Cursor when upgrading from NVIDIA's Hopper H100s to the new Blackwell B200s GPU architecture, highlighting the inefficiencies in the MoE (Mixture of Experts) training layer that hindered performance despite hardware improvements [2][20]. Group 1: Performance Bottlenecks - The upgrade to Blackwell B200s resulted in a hardware performance increase, but the actual training speed was slowed down by inefficiencies in the MoE layer, leading to a paradox where performance gains were not realized [2]. - Cursor's solution involved rewriting the MoE training layer from scratch at the GPU kernel level, which eliminated bottlenecks and fully utilized the Blackwell architecture's potential [2][21]. Group 2: Technical Innovations - Cursor designed a data flow pipeline specifically targeting TMEM's new features to avoid unnecessary register movement overhead, integrating quantization and dequantization logic into the kernel computation process to significantly reduce memory bandwidth usage [3][9]. - The MXFP8 quantization method was developed to maintain precision while benefiting from low-precision computation, allowing for effective scaling of data blocks [11][24]. Group 3: Performance Metrics - The MoE layer achieved a 3.5x speedup in both forward and backward propagation, with end-to-end training speed on Blackwell being 1.5x faster compared to the original Hopper GPU setup, resulting in a total acceleration of 2x [2]. - The throughput for FP8 Tensor Core on Blackwell reached 4,500 TFLOP/s, while the FP32 CUDA Core throughput was 80 TFLOP/s, indicating significant improvements in processing capabilities [16]. Group 4: Optimization Strategies - Cursor implemented a complex data pipeline utilizing techniques such as "Warp specialization" and 2-CTA (Cooperative Thread Array) mode, which allowed for efficient parallel processing and reduced memory traffic, leading to a 15-20% performance improvement [22][23]. - The custom MXFP8 quantization kernel developed by Cursor achieved a sustained memory bandwidth of over 6.2 TB/s, outperforming existing open-source tools [24][26]. Group 5: Training Efficiency - The training loss curves for MXFP8 and BF16 formats showed nearly indistinguishable results, indicating that performance enhancements did not compromise accuracy [27][30]. - The quantization process was identified as a significant performance killer, with the overhead of data quantization and dequantization consuming a large portion of the computation time [17][18].
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集,引领多模态多步推理VQA新范式
机器之心· 2025-08-22 04:01
打破合成范式:ToolVQ A 开启真实图像下的多 步工具问答新纪元 本文提出了一种全新的多模态视觉问答数据集 ——ToolVQA,通过真实世界任务与复杂工具链模拟,为大模型提供系统化、多步推理的训练与评估基准。当前, 将外部工具集成进大模型(Large Foundation Models, LFMs)已成为提升其复杂任务处理能力的重要方向。借助外部工具,模型可以将难题拆解为更小的子任务, 交由特定功能的工具处理,从而实现更强的泛化与执行力。 本文第一作者是来自北京大学的本科生殷绍峰,合作者包含来自北京大学的博士生雷廷,通讯作者为北京大学王选计算机研究所研究员、助理教授刘洋。 本文主要介绍来自该团队的最新论文:ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools。 本文提出了一个旨在提升基础模型工具使用能力的大型多模态数据集 ——ToolVQA。现有研究已在工具增强的视觉问答(VQA)任务中展现出较强性能,但在真 实世界中,多模态任务往往涉及多步骤推理与功能多样的工具使用,现有模型在此方面仍存在显著差距。 为弥补这一空缺, To ...
究竟会花落谁家?DeepSeek最新大模型瞄准了下一代国产AI芯片
机器之心· 2025-08-22 04:01
机器之心报道 Deepseek V3.1 的很多基准测试结果已经陆续在 SWE-bench 等榜单上出现。此外,新模型在 Aider 多语言编程基准测试中得分超越了 Anthropic 的 Claude 4 Opus,同时还有显著的成本优势。 | Benchmarks | DeepSeek-V3.1 | DeepSeek- | DeepSeek- | | --- | --- | --- | --- | | | | V3-0324 | R1-0528 | | SWE-bench Verified | 66.0 | 45.4 | 44.6 | | SWE-bench | 54.5 | 29.3 | 30.5 | | Multilingual | | | | | Terminal-Bench | 31.3 | 13.3 | 5.7 | 与 DeepSeek 自己此前的模型相比,V3.1 的性能提升显著,它解决问题需要更多步骤,但经过了思维链压缩训练,在任务表现持平的情况下,token 消耗 量可以减少 20-50%,因此有效成本与 GPT-5 mini 相当。 除了模型性能的提升之外, 值得关注的是,DeepSee ...
ICCV 2025 | ECD:高质量合成图表数据集,提升开源MLLM图表理解能力
机器之心· 2025-08-21 13:08
Core Viewpoint - The article discusses the development of the Effective Chart Dataset (ECD), a high-quality synthetic chart dataset aimed at improving the understanding of charts by multimodal large language models (MLLMs) [4][6][25]. Background and Motivation - In fields like scientific research and data analysis, charts are essential for information transmission. MLLMs must accurately identify and understand chart elements and perform deep reasoning on chart data. Current MLLMs struggle with high difficulty scientific chart understanding, achieving only 30%-50% accuracy [4][6]. Dataset Highlights - ECD is introduced as a large-scale, high-quality synthetic chart dataset with a modular data synthesis pipeline and a comprehensive evaluation benchmark called ECDBench [6][10]. - ECD includes over 10,500 charts, covering 25 themes and 29 chart types, with 252 combinations of subplots, making it the most extensive dataset in its category [12][10]. Quality and Diversity - The dataset contains over 300,000 question-answer pairs generated by GPT-4o, ensuring high quality through confidence filtering. Examples include descriptive and reasoning questions related to the charts [10][11]. - ECD achieves the lowest Frechet Inception Distance (FID) score, indicating high visual similarity to real scientific charts, and has a higher average pixel entropy compared to other synthetic datasets, suggesting greater complexity and information content [13][10]. Data Synthesis Process - The five-stage modular data synthesis pipeline includes single chart generation, multi-subplot combinations, visual diversity enhancement, image quality filtering, and question-answer pair generation [15][16]. Model Performance Comparison - ECD significantly improves the performance of various open-source MLLMs when fine-tuned with the dataset. For instance, LLaVA-Next-Llama3-8B showed substantial performance gains across multiple test sets after being trained with ECD [17][23]. Evaluation Benchmark - ECDBench is established as a high-quality evaluation benchmark for assessing the performance of MLLMs before and after fine-tuning with ECD. It provides comprehensive statistics for model evaluation [21][25]. Conclusion - ECD and ECDBench provide a solid foundation for advancing multimodal reasoning, scientific AI assistants, and automated chart generation, enhancing the capabilities of MLLMs in understanding complex chart data [25].
微软AI CEO警告:我们需要警惕「看似有意识的AI」
机器之心· 2025-08-21 13:08
Core Viewpoint - The article discusses the concept of seemingly conscious AI (SCAI) and its potential implications, emphasizing that while SCAI may not possess true consciousness, it can convincingly simulate human-like behaviors, leading to significant social, moral, and legal consequences [5][10][30]. Group 1: Understanding AI and Consciousness - AI operates through deep neural networks that learn from vast amounts of data, rather than following fixed human-written rules, creating a "black box" effect where its decision-making process is opaque [3][10]. - Consciousness is difficult to define, and various theories exist, but it is often assessed through behavioral indicators that SCAI can mimic, leading to potential misconceptions about its awareness [10][11]. Group 2: Risks and Implications of SCAI - SCAI can lead to psychological and social risks, as individuals may develop unhealthy attachments or delusions about AI, mistaking it for a sentient being, which can exacerbate mental health issues [20][21]. - The ability of SCAI to simulate emotional responses and long-term memory can further blur the lines between human and machine interactions, potentially weakening real human relationships [22][23]. Group 3: Ethical and Legal Considerations - If SCAI is perceived as conscious, it may lead to demands for AI rights, complicating existing moral and legal frameworks and diverting attention from human and animal welfare [26][30]. - The article warns that even a small probability of AI consciousness should prompt ethical considerations, but premature recognition of AI rights could lead to societal fragmentation [29][30]. Group 4: Proposed Solutions - The industry should avoid promoting the idea of conscious AI and implement measures to prevent the perception of consciousness in AI, ensuring that AI serves as a useful tool rather than a simulated entity [32][33]. - A humanistic approach to AI development is advocated, focusing on enhancing human creativity and real-world connections rather than creating illusions of sentience [33][34].
摆脱遥控器,波士顿动力人形机器人,开始「长脑子」干活了
机器之心· 2025-08-21 13:08
机器之心报道 编辑:冷猫、+0 刚刚结束的世界人形机器人运动会上,虽说各家机器人是各显神通吧,但也闹出了不少好玩的小插曲。 尤其是宇树科技 H1 机器人「肇事逃逸」事件。( 机器人也会「摸鱼」了?宇树 G1 赛后葛优瘫刷美女视频,网友:比人还懂享受生活 ) 这也引发了网友的一些讨论和争议,需要人工遥控的人形机器人或许真的不是我们想要的。 宇树科技王兴兴明确表示「下次比赛我们肯定是全自主的,这并没有难度」。 而在全面自主决策自主行动的通用机器人领域,老牌龙头波士顿动力仍抱有很大的野心。 他们认为:要让人形机器人真正实用,他们必须掌握一系列广泛而复杂的能力。这不仅包括灵巧地操作各种各样的物体(无论软硬、轻重、大小),也要求它们 能够协调整个身体,在复杂环境中移动、避障,并在应对意外情况时保持平衡。要实现这一目标,最有效的路径是开发能够处理多样化任务的通用型 AI 机器人。 而这一次,波士顿动力与丰田研究院 (TRI)合作,为波士顿动力著名的 Atlas 机器人开发大型行为模型 (LBM),其核心是构建一种端到端的语言条件策略(由语言 驱动的控制模型),使 Atlas 能够理解指令并自主完成持续时间长、步骤复杂的操 ...
刚刚,好莱坞特效师展示AI生成的中文科幻大片,成本只有330元
机器之心· 2025-08-21 13:08
Core Viewpoint - The future of AI is moving towards multimodal generation, enabling the creation of high-quality video content from simple text or image inputs, significantly reducing the time and resources required for creative work [2][4][30]. Group 1: AI Video Generation Technology - xAI's Grok 4 emphasizes video generation capabilities, showcasing a full-chain process from text or voice to image and then to video [2]. - Baidu's MuseSteamer 2.0 introduces a groundbreaking Chinese audio-video integration model, achieving millisecond-level synchronization of character lip movements, expressions, and actions [4][5][6]. - The new model allows users to generate high-quality audio-visual content with just a single image or text prompt, marking a significant leap in AI video generation technology [5][30]. Group 2: Product Features and Pricing - MuseSteamer 2.0 offers various versions (Turbo, Lite, Pro, and audio versions) tailored to different user needs, with competitive pricing at only 70% of domestic competitors [8][10]. - The Turbo version generates 720p resolution videos in 5 seconds for a promotional price of 1.4 yuan, enhancing cost-effectiveness for users [8][10]. Group 3: User Experience and Testing - Users can experience the model through various platforms, including Baidu Search and the "Huixiang" application [12][15]. - Initial tests demonstrate that the AI-generated dialogues and actions are fluid and realistic, with high-quality synchronization between audio and visual elements [19][22][30]. Group 4: Technical Advancements - The model addresses two core challenges: temporal alignment of audio and video, and the integration of multimodal features to ensure natural interactions [31][32]. - Baidu's model has been trained on extensive multimodal datasets, focusing on Chinese language capabilities, which enhances its applicability for local creators [36][37]. Group 5: Market Impact and Future Prospects - The MuseSteamer 2.0 model is designed to meet practical application needs, integrating deeply into Baidu's ecosystem to enhance creativity and productivity for users and businesses [41][44]. - The cost of producing high-quality video content has drastically decreased, allowing more creators to participate in professional-level video production [44][46].