Workflow
机器之心
icon
Search documents
扩散LLM推理新范式:打破生成长度限制,实现动态自适应调节
机器之心· 2025-08-08 10:18
随着 Gemini-Diffusion,Seed-Diffusion 等扩散大语言模型(DLLM)的发布,这一领域成为了工业界和学术界的热门方向。但是,当前 DLLM 存在着在推理时必须 采用预设固定长度的限制,对于不同任务都需要专门调整才能达到最优效果。 为了解决这一本质的问题,香港中文大学 MMLab,上海 AI 实验室等提出 DAEDAL,赋予 DLLM 可以根据问题的具体情况自主调整回答长度的能力,弥补了 DLLM 与自回归 LLM 的关键差距,为更灵活、高效、强大的扩散大语言模型打下了基石。 论文标题:Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models 论文地址:https://arxiv.org/abs/2508.00819 代码地址:https://github.com/Li-Jinsong/DAEDAL DAEDAL 作为一种 Training Free 的去噪策略,从一个统一且很短的初始长度开始,让模型根据自己的需求在生成中调节长度,动态扩展,达到了和现有去噪策略 在每个评测基准上精心调整生成 ...
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].
GPT-5真的拉胯吗?机器之心一手实测,网友:还我4o、还我4.5
机器之心· 2025-08-08 07:53
一觉醒来,朋友圈被 GPT-5 刷了屏。 在昨晚长达一个多小时的发布直播中,OpenAI 介绍了 GPT-5 的性能,演示了诸多实用案例,在此不赘述,感兴趣的朋友可以移步: 刚刚,奥特曼发布 GPT-5!人人免费用「博士级」智能,基准图错误遭全网吐槽 。 奥特曼发推表示,GPT-5 是我们迄今为止最智能的模型。 机器之心报道 机器之心编辑部 有人给好评,有人给差评。 不过,网上对 GPT-5 的评价褒贬不一。 有人表示,提前体验 GPT-5 将近两周,发现它展现了巨大的进步,超越了之前的版本,并且在科学推理、事实准确性和创意表达方面达到了新的高度。 LMArena 基准测评结果也已出炉,GPT-5 在文本、网页开发、视觉领域、难题、编程、数学、创意、长查询等各个领域都排名第一。 网友 @emollick 则认为 GPT-5 非常聪明,并且能完成各种任务,是一个非常重大的突破。 比如让它制作一个程序化的野兽派建筑生成器,可以以酷炫的方式拖拽和编辑建筑、并不断改进它。 也有人表示,GPT-5在前端体验、减少幻觉和提升写作质量方面有显著改进,免费用户和企业用户将感受到明显的提升。 但也有不少人给出了差评。 网友 @ ...
刚刚,奥特曼发布GPT-5!人人免费用「博士级」智能,基准图错误遭全网吐槽
机器之心· 2025-08-07 20:48
机器之心报道 机器之心编辑部 都看了吗? 等了多年的 GPT-5,终于在 这个凌晨发 布了。 我们一脸的期待,直播中 OpenAI 几位核心人员的紧张也肉眼可见。 直播过程中,奥特曼也是连发十几条推特,介绍 GPT-5 的看点。 因为信息点比较多,我们就以奥特曼的推特内容为依据为大家一一介绍。 首先,这是一个 集成模 型 。也就是说,你用它的时候不需要在不同模型之间切换,它会自己决定何时需要深入思考。 尽管奥特曼强调 benchmark 不重要,但他们还是晒出了不少跑分结果,比如在数学、编程、视觉感知和健康领域。具体跑分如下: 费用方面,GPT-5 分为免费版、Plus 和 Pro 计划。根据奥特曼的说法,免费版也能用上「博士级别的智能」(GPT-5 普通版,但带推理功能),Plus 用户在使用频 率上限制更少,而 Pro 用户可以用上 GPT-5 Pro。 面向开发者,GPT-5 的三个版本 API 价格如下:标准版 GPT-5 为每百万输入 Token 1.25 美元,每百万输出 Token 10 美元,GPT-5 mini 版 与 Nano 版会更便宜。 数学领域 :在 2025 年 AIME 测试中无 ...
北大、字节跳动联手发布SWE-Swiss:一把修复代码Bug的「瑞士军刀」,完整配方直指开源SOTA
机器之心· 2025-08-07 20:48
图 1: SWE-bench Verified 上的性能与模型尺寸对比。该研究的 32B 模型 SWE-Swiss ,取得了 60.2% 的顶级分数,与更大的模型如 Kimi-Dev, DeepSeek-R1-0528 处于同一梯队。这证明了该研究的训练配方能 让一个小得多的模型达到同样的 SOTA 性能级别,凸显了其卓越的效率。 近日, 一项由北京大学、字节跳动 Seed 团队及香港大学联合进行 的研究,提出了一种名为「SWE-Swiss」的完整「配方」,旨在高效训练用于解决软件工程问题 的 AI 模型。研究团队推出的 3 2 B 参数模型 SWE-Swiss-32B,在权威基准 SWE-bench Verified 上取得了 60.2% 的准确率,在同尺寸级别中达到了新的 SOTA。该工 作证明,通过精巧的方法论设计,中等规模的模型完全有能力实现顶级性能,为 AI 在软件工程领域的应用提供了新的思路。为促进社区发展,该研究的模型、数 据集将全部开源。 引言:软件工程 AI 的挑战与机遇 自动化解决真实世界的软件问题,是大型语言模型(LLM)面临的一项艰巨挑战。相较于纯粹的代码生成,这项任务要求模型具备理解 ...
云计算一哥首度牵手OpenAI,大模型「选择」自由,才是终极胜利
机器之心· 2025-08-07 10:30
Core Viewpoint - The collaboration between Amazon Web Services (AWS) and OpenAI marks a significant shift in the AI cloud service landscape, breaking Microsoft's monopoly on reselling OpenAI's software and services, and enhancing AWS's competitive edge in the large model cloud service market [3][15]. Summary by Sections Collaboration Announcement - AWS announced support for OpenAI's newly open-sourced models, gpt-oss (120b and 20b), and Anthropic's Claude Opus 4.1, through its platforms Amazon Bedrock and Amazon SageMaker AI [1][4][16]. Strategic Importance - This partnership allows AWS to fill a critical gap in its model library, enhancing its "Choice Matters" strategy, which emphasizes the importance of diverse model options for various industry needs [7][10][15]. Model Ecosystem Development - AWS's platforms now host over 400 mainstream commercial and open-source large models, facilitating a diverse AI ecosystem that accelerates technology adoption and innovation in the AI industry [10][18]. Performance and Cost Efficiency - The performance of gpt-oss-120b is reported to be three times more cost-effective than Google's Gemini, five times that of DeepSeek-R1, and twice that of OpenAI's o4, providing budget-friendly access to top-tier AI capabilities for small and medium enterprises [14][15]. Enhanced Model Deployment - AWS's Amazon SageMaker JumpStart allows for rapid deployment of advanced foundational models, including OpenAI's offerings, enabling efficient customization and optimization for AI applications [14][24]. Future Prospects - The collaboration is expected to create a win-win situation, expanding OpenAI's market reach while solidifying AWS's position as a leading platform for deploying and running various AI models [15][19]. AI Ecosystem Transformation - AWS is evolving from a cloud service provider to an AI capability aggregation platform, enhancing its role in the AI ecosystem and providing better service to customers and developers [19][29]. Model Selection Flexibility - The "Choice Matters" strategy addresses the diverse needs of different tasks, allowing developers to select models based on specific requirements, thus maximizing efficiency and effectiveness in AI applications [21][24]. Conclusion - The integration of multiple models into a single platform is anticipated to lead to a significant surge in AI application development, enabling innovative solutions through the combination of various models [30][31].
颠覆互联网的下一波浪潮:Agentic Web来了!
机器之心· 2025-08-07 10:30
Core Viewpoint - The article discusses the emergence of the "Agentic Web," a new paradigm in internet usage where AI agents autonomously complete tasks based on user-defined goals, marking a significant shift from traditional web interactions [3][6][57]. Group 1: Paradigm Shifts in the Web - The internet has undergone three major paradigm shifts: from a keyword-driven "directory web" in the PC era, to a recommendation-driven "content explosion" in the mobile era, and now to an "action network" driven by AI agents [8][9][15]. - In the Agentic Web, the role of the web transitions from being an information repository to an ecosystem of actionable resources for AI agents [13][15]. Group 2: Definition and Structure of Agentic Web - The Agentic Web is defined as a distributed, interactive ecosystem where AI agents, powered by large language models (LLMs), continuously plan, coordinate, and execute goal-oriented tasks [16][17]. - Users interact with the web by delegating tasks to AI agents, which autonomously handle the execution of these tasks [20][21]. Group 3: Core Dimensions of Agentic Web - The structure of the Agentic Web can be understood through three core dimensions: intelligence, interaction, and economy [24][28]. - The "Agent Attention Economy" signifies a shift in focus from human clicks to AI agent interactions, changing the metrics of commercial competition [29]. Group 4: Application Scenarios - The capabilities of the Agentic Web can be categorized into transactional, informational, and communicational tasks, enabling AI agents to perform a wide range of functions from booking tickets to conducting research [30][31]. - In transactional tasks, users can simply state their needs, and AI agents will autonomously complete the entire process, enhancing efficiency [33]. Group 5: Challenges Ahead - The implementation of the Agentic Web faces systemic challenges, including the need for improved AI capabilities, robust network infrastructure, and a redefined economic model [42][43]. - Key challenges include ensuring the reasoning and memory capabilities of AI agents, managing security risks associated with external tools, and establishing effective communication protocols for multi-agent collaboration [44][51]. Group 6: Socio-Economic Implications - The rise of the Agentic Web poses significant implications for traditional advertising models, labor markets, and economic structures, necessitating a reevaluation of how value is created and distributed in the digital economy [56][58]. - As AI agents become more prevalent, there is a pressing need to address the potential displacement of jobs and ensure equitable economic benefits [56][58].
让AI读懂「言外之意」:AI4SG团队发布首个心理健康污名语料库,破解隐性偏见识别难题
机器之心· 2025-08-07 09:42
论文第一作者 Han Meng 是新加坡国立大学博士生,从事心理学构建的计算方法研究。通讯作者 Yi-Chieh Lee 是新加坡国立大学助理教授,在对话式人工智能、人机交互和心理健康技术领域开展研究工作。共同作 者 Renwen Zhang 是南洋理工大学助理教授,专注于计算传播学研究,为本研究提供了传播学视角。 Jungup Lee 是新加坡国立大学副教授,在心理健康领域有深入研究,为本研究提供了重要的领域知识支 撑。 心理健康问题影响着全球数亿人的生活,然而患者往往面临着双重负担:不仅要承受疾病本身的痛苦, 还要忍受来自社会的偏见和歧视。世界卫生组织数据显示,全球有相当比例的心理健康患者因为恐惧社 会歧视而延迟或拒绝治疗。 这种「 污名化 」现象如同隐形的障碍,不仅阻碍了患者的康复之路,更成为了一个重要的社会问题。 患者们在承受病痛的同时,还要面对来自不同社会环境中的偏见。更为复杂的是,这种污名化往往以微 妙、隐蔽的形式存在于日常对话中,即使是先进的人工智能系统也难以有效识别。 尽管自然语言处理领域在仇恨言论、攻击性语言检测方面已有不少研究,但专门针对心理健康污名的计 算资源却相对稀缺。现有数据集主要来 ...
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
机器之心报道 机器之心编辑部 众所周知,大型语言模型的训练通常分为两个阶段。 第一 阶段 是「预训练」 ,开发者利用大规模文本数据集训练模型,让它学会预测句子中的下一个词。 第二 阶段是「后训练」 ,旨在教会模型如何更好地理解和执行人类指令。 在 LLM 后训练阶段,似乎是一个强化学习的特殊形式。用于大语言模型(LLMs)微调的强化学习(RL)算法正沿着一条明确的演进路径持续发展。 起初,OpenAI 开创了一种名为 基于 人类反馈的强化学习(RLHF) 的技术,用于改进 ChatGPT。RLHF 的核心是让人类标注员对模型生成的多种响应进行打分, 并选出最优答案作为训练参考。这一过程虽然有效,但也耗时、昂贵且依赖人力,通常需要一支小型但专业的数据标注团队。 DeepSeek 的重要创新在于用 RL 技术自动化了这一环节。算法不再依赖人工逐一评估,而是让模型在探索过程中,通过获得「奖励信号」自主学习正确行为,从 而显著降低了成本,提高了效率,最终能以较低的成本实现高性能。 OpenAI 在 ChatGPT 的训练中采用了 近端策略优化(Proximal Policy Optimization, PPO) 。 ...
三重激励+全周期扶持,即梦升级这个计划,让AI创作者的成长有迹可循
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the comprehensive upgrade of the "AI Creator Growth Program" by Jimeng AI, emphasizing the need for a supportive ecosystem for creators in the AI content production landscape, which has seen a significant transformation due to AI technology [9][10]. Summary by Sections AI Content Creation Revolution - The past year has witnessed a revolution in content creation driven by AI technology, breaking down traditional barriers and allowing individual creators to produce high-quality content with minimal resources [9]. - The efficiency of creation has been redefined, leading to fundamental changes in content forms, styles, and cost structures [9]. Challenges for Creators - Despite the advancements, creators face challenges such as high competition, lack of sustainable growth paths, limited monetization opportunities, and insufficient support within the creative ecosystem [9][10]. Jimeng AI Creator Growth Program - Launched in February, the program aims to provide tangible support to creators through incentives, collaboration opportunities, and traffic distribution, having already supported 3,802 creators and distributed over 28 million points [11]. - The program has been upgraded to include a three-tier support system for potential stars, advanced creators, and super creators, offering various resources like point rewards, platform traffic, and project access [11][12]. Tiered Support Mechanism - For potential stars, creators can earn points by publishing content, with rewards for popular ideas and meeting content standards [13]. - Advanced creators can access additional benefits, including cash rewards for top-performing content and various resources for growth [14][15]. - Super creators receive the most comprehensive support, including significant point rewards, priority access to projects, and funding for their own initiatives [16][17]. Community Building and Ecosystem - Jimeng AI aims to build a sustainable and growth-oriented creative ecosystem, integrating various AI capabilities for a seamless creation experience [20]. - The platform encourages a diverse and decentralized community of creators, fostering collaboration and quality content production [23]. - Regular activities like online workshops and creative challenges are organized to stimulate community engagement and provide visibility for creators [24].