Workflow
机器之心
icon
Search documents
挤不动的世界机器人大会上,自变量秀出了真·通用具身智能
机器之心· 2025-08-08 10:18
机器之心报道 编辑:泽南 会整理家务、制作香囊,还能比心比耶。 具身智能已经进化到这种程度了,真实满足用户需求似乎指日可待。 今天上午,2025 世界机器人大会 WRC 正式开幕。最近的 AI 技术突破,让参展的公司纷纷拿出了新技术,人头攒动的展台之间,我们看到了一众「具身智能」加 持的机器人,其中很多还是首次发布。 它们从整理家务、工业物流、制造装配到跳舞表演可谓无所不能,形态也各式各样,颇有些前不久大模型「百模大战」的样子。不过在这其中,有一家的公司反 其道而行之,实现了「一脑多用」的真・通用智能。 国内头部创业公司「自变量机器人」给具身智能定义了一个新的标准。 一脑多用 覆盖多种场景 在 WRC 的展台上,自变量的通用轮式双臂机器人「小量」在制作香囊,为看展观众们送上专属小礼物。 它内置自变量自研的通用具身大模型 WALL-A,短短几天时间内就学会了自主制作香囊,而且工作时不挑环境 —— 不论展会现场周遭的声光环境有多复杂,人群 动向如何,都不会干扰到它的「细活」。 如果提出指令,机器人能根据观众的喜好,自主拾取不同的香包,在不到 10cm 的空间中,双臂精巧配合,处理柔性物体的复杂形变,完成香囊的制作 ...
4比0横扫Grok 4,o3强势夺冠,首届大模型对抗赛结果出炉
机器之心· 2025-08-08 10:18
Core Viewpoint - The first Kaggle AI Chess Championship concluded with o3 defeating Grok 4 decisively, showcasing the advancements in AI chess models and their competitive capabilities [2][4][15]. Group 1: Championship Results - o3 won the championship by sweeping Grok 4 with a score of 4-0 [4][15]. - Gemini 2.5 Pro secured third place by defeating o4-mini with a score of 3.5-0.5 [4][17]. Group 2: Performance Analysis - Grok 4, initially a strong contender, made critical mistakes during the final match, leading to its unexpected defeat [6][7][8]. - In the first game, Grok 4 lost a piece early on, which set a negative tone for the rest of the match [8][10]. - The second game featured a risky opening strategy from Grok 4 that resulted in a significant blunder, allowing o3 to capitalize easily [10][12]. - The third game saw Grok 4 fail to maintain its position, leading to a complete loss despite initial promise [12][13]. - The final game was closely contested, but o3 demonstrated superior endgame skills, ultimately securing victory [13][15]. Group 3: Insights on Competitors - Gemini 2.5 Pro's performance was marked by inconsistency, with several amateur-level mistakes during its matches [17][19]. - Despite the chaotic nature of the matches, Gemini managed to secure third place, indicating potential for future improvements [24].
扩散LLM推理新范式:打破生成长度限制,实现动态自适应调节
机器之心· 2025-08-08 10:18
随着 Gemini-Diffusion,Seed-Diffusion 等扩散大语言模型(DLLM)的发布,这一领域成为了工业界和学术界的热门方向。但是,当前 DLLM 存在着在推理时必须 采用预设固定长度的限制,对于不同任务都需要专门调整才能达到最优效果。 为了解决这一本质的问题,香港中文大学 MMLab,上海 AI 实验室等提出 DAEDAL,赋予 DLLM 可以根据问题的具体情况自主调整回答长度的能力,弥补了 DLLM 与自回归 LLM 的关键差距,为更灵活、高效、强大的扩散大语言模型打下了基石。 论文标题:Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models 论文地址:https://arxiv.org/abs/2508.00819 代码地址:https://github.com/Li-Jinsong/DAEDAL DAEDAL 作为一种 Training Free 的去噪策略,从一个统一且很短的初始长度开始,让模型根据自己的需求在生成中调节长度,动态扩展,达到了和现有去噪策略 在每个评测基准上精心调整生成 ...
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].
GPT-5真的拉胯吗?机器之心一手实测,网友:还我4o、还我4.5
机器之心· 2025-08-08 07:53
一觉醒来,朋友圈被 GPT-5 刷了屏。 在昨晚长达一个多小时的发布直播中,OpenAI 介绍了 GPT-5 的性能,演示了诸多实用案例,在此不赘述,感兴趣的朋友可以移步: 刚刚,奥特曼发布 GPT-5!人人免费用「博士级」智能,基准图错误遭全网吐槽 。 奥特曼发推表示,GPT-5 是我们迄今为止最智能的模型。 机器之心报道 机器之心编辑部 有人给好评,有人给差评。 不过,网上对 GPT-5 的评价褒贬不一。 有人表示,提前体验 GPT-5 将近两周,发现它展现了巨大的进步,超越了之前的版本,并且在科学推理、事实准确性和创意表达方面达到了新的高度。 LMArena 基准测评结果也已出炉,GPT-5 在文本、网页开发、视觉领域、难题、编程、数学、创意、长查询等各个领域都排名第一。 网友 @emollick 则认为 GPT-5 非常聪明,并且能完成各种任务,是一个非常重大的突破。 比如让它制作一个程序化的野兽派建筑生成器,可以以酷炫的方式拖拽和编辑建筑、并不断改进它。 也有人表示,GPT-5在前端体验、减少幻觉和提升写作质量方面有显著改进,免费用户和企业用户将感受到明显的提升。 但也有不少人给出了差评。 网友 @ ...
刚刚,奥特曼发布GPT-5!人人免费用「博士级」智能,基准图错误遭全网吐槽
机器之心· 2025-08-07 20:48
机器之心报道 机器之心编辑部 都看了吗? 等了多年的 GPT-5,终于在 这个凌晨发 布了。 我们一脸的期待,直播中 OpenAI 几位核心人员的紧张也肉眼可见。 直播过程中,奥特曼也是连发十几条推特,介绍 GPT-5 的看点。 因为信息点比较多,我们就以奥特曼的推特内容为依据为大家一一介绍。 首先,这是一个 集成模 型 。也就是说,你用它的时候不需要在不同模型之间切换,它会自己决定何时需要深入思考。 尽管奥特曼强调 benchmark 不重要,但他们还是晒出了不少跑分结果,比如在数学、编程、视觉感知和健康领域。具体跑分如下: 费用方面,GPT-5 分为免费版、Plus 和 Pro 计划。根据奥特曼的说法,免费版也能用上「博士级别的智能」(GPT-5 普通版,但带推理功能),Plus 用户在使用频 率上限制更少,而 Pro 用户可以用上 GPT-5 Pro。 面向开发者,GPT-5 的三个版本 API 价格如下:标准版 GPT-5 为每百万输入 Token 1.25 美元,每百万输出 Token 10 美元,GPT-5 mini 版 与 Nano 版会更便宜。 数学领域 :在 2025 年 AIME 测试中无 ...
北大、字节跳动联手发布SWE-Swiss:一把修复代码Bug的「瑞士军刀」,完整配方直指开源SOTA
机器之心· 2025-08-07 20:48
图 1: SWE-bench Verified 上的性能与模型尺寸对比。该研究的 32B 模型 SWE-Swiss ,取得了 60.2% 的顶级分数,与更大的模型如 Kimi-Dev, DeepSeek-R1-0528 处于同一梯队。这证明了该研究的训练配方能 让一个小得多的模型达到同样的 SOTA 性能级别,凸显了其卓越的效率。 近日, 一项由北京大学、字节跳动 Seed 团队及香港大学联合进行 的研究,提出了一种名为「SWE-Swiss」的完整「配方」,旨在高效训练用于解决软件工程问题 的 AI 模型。研究团队推出的 3 2 B 参数模型 SWE-Swiss-32B,在权威基准 SWE-bench Verified 上取得了 60.2% 的准确率,在同尺寸级别中达到了新的 SOTA。该工 作证明,通过精巧的方法论设计,中等规模的模型完全有能力实现顶级性能,为 AI 在软件工程领域的应用提供了新的思路。为促进社区发展,该研究的模型、数 据集将全部开源。 引言:软件工程 AI 的挑战与机遇 自动化解决真实世界的软件问题,是大型语言模型(LLM)面临的一项艰巨挑战。相较于纯粹的代码生成,这项任务要求模型具备理解 ...
颠覆互联网的下一波浪潮:Agentic Web来了!
机器之心· 2025-08-07 10:30
Core Viewpoint - The article discusses the emergence of the "Agentic Web," a new paradigm in internet usage where AI agents autonomously complete tasks based on user-defined goals, marking a significant shift from traditional web interactions [3][6][57]. Group 1: Paradigm Shifts in the Web - The internet has undergone three major paradigm shifts: from a keyword-driven "directory web" in the PC era, to a recommendation-driven "content explosion" in the mobile era, and now to an "action network" driven by AI agents [8][9][15]. - In the Agentic Web, the role of the web transitions from being an information repository to an ecosystem of actionable resources for AI agents [13][15]. Group 2: Definition and Structure of Agentic Web - The Agentic Web is defined as a distributed, interactive ecosystem where AI agents, powered by large language models (LLMs), continuously plan, coordinate, and execute goal-oriented tasks [16][17]. - Users interact with the web by delegating tasks to AI agents, which autonomously handle the execution of these tasks [20][21]. Group 3: Core Dimensions of Agentic Web - The structure of the Agentic Web can be understood through three core dimensions: intelligence, interaction, and economy [24][28]. - The "Agent Attention Economy" signifies a shift in focus from human clicks to AI agent interactions, changing the metrics of commercial competition [29]. Group 4: Application Scenarios - The capabilities of the Agentic Web can be categorized into transactional, informational, and communicational tasks, enabling AI agents to perform a wide range of functions from booking tickets to conducting research [30][31]. - In transactional tasks, users can simply state their needs, and AI agents will autonomously complete the entire process, enhancing efficiency [33]. Group 5: Challenges Ahead - The implementation of the Agentic Web faces systemic challenges, including the need for improved AI capabilities, robust network infrastructure, and a redefined economic model [42][43]. - Key challenges include ensuring the reasoning and memory capabilities of AI agents, managing security risks associated with external tools, and establishing effective communication protocols for multi-agent collaboration [44][51]. Group 6: Socio-Economic Implications - The rise of the Agentic Web poses significant implications for traditional advertising models, labor markets, and economic structures, necessitating a reevaluation of how value is created and distributed in the digital economy [56][58]. - As AI agents become more prevalent, there is a pressing need to address the potential displacement of jobs and ensure equitable economic benefits [56][58].
云计算一哥首度牵手OpenAI,大模型「选择」自由,才是终极胜利
机器之心· 2025-08-07 10:30
Core Viewpoint - The collaboration between Amazon Web Services (AWS) and OpenAI marks a significant shift in the AI cloud service landscape, breaking Microsoft's monopoly on reselling OpenAI's software and services, and enhancing AWS's competitive edge in the large model cloud service market [3][15]. Summary by Sections Collaboration Announcement - AWS announced support for OpenAI's newly open-sourced models, gpt-oss (120b and 20b), and Anthropic's Claude Opus 4.1, through its platforms Amazon Bedrock and Amazon SageMaker AI [1][4][16]. Strategic Importance - This partnership allows AWS to fill a critical gap in its model library, enhancing its "Choice Matters" strategy, which emphasizes the importance of diverse model options for various industry needs [7][10][15]. Model Ecosystem Development - AWS's platforms now host over 400 mainstream commercial and open-source large models, facilitating a diverse AI ecosystem that accelerates technology adoption and innovation in the AI industry [10][18]. Performance and Cost Efficiency - The performance of gpt-oss-120b is reported to be three times more cost-effective than Google's Gemini, five times that of DeepSeek-R1, and twice that of OpenAI's o4, providing budget-friendly access to top-tier AI capabilities for small and medium enterprises [14][15]. Enhanced Model Deployment - AWS's Amazon SageMaker JumpStart allows for rapid deployment of advanced foundational models, including OpenAI's offerings, enabling efficient customization and optimization for AI applications [14][24]. Future Prospects - The collaboration is expected to create a win-win situation, expanding OpenAI's market reach while solidifying AWS's position as a leading platform for deploying and running various AI models [15][19]. AI Ecosystem Transformation - AWS is evolving from a cloud service provider to an AI capability aggregation platform, enhancing its role in the AI ecosystem and providing better service to customers and developers [19][29]. Model Selection Flexibility - The "Choice Matters" strategy addresses the diverse needs of different tasks, allowing developers to select models based on specific requirements, thus maximizing efficiency and effectiveness in AI applications [21][24]. Conclusion - The integration of multiple models into a single platform is anticipated to lead to a significant surge in AI application development, enabling innovative solutions through the combination of various models [30][31].
让AI读懂「言外之意」:AI4SG团队发布首个心理健康污名语料库,破解隐性偏见识别难题
机器之心· 2025-08-07 09:42
论文第一作者 Han Meng 是新加坡国立大学博士生,从事心理学构建的计算方法研究。通讯作者 Yi-Chieh Lee 是新加坡国立大学助理教授,在对话式人工智能、人机交互和心理健康技术领域开展研究工作。共同作 者 Renwen Zhang 是南洋理工大学助理教授,专注于计算传播学研究,为本研究提供了传播学视角。 Jungup Lee 是新加坡国立大学副教授,在心理健康领域有深入研究,为本研究提供了重要的领域知识支 撑。 心理健康问题影响着全球数亿人的生活,然而患者往往面临着双重负担:不仅要承受疾病本身的痛苦, 还要忍受来自社会的偏见和歧视。世界卫生组织数据显示,全球有相当比例的心理健康患者因为恐惧社 会歧视而延迟或拒绝治疗。 这种「 污名化 」现象如同隐形的障碍,不仅阻碍了患者的康复之路,更成为了一个重要的社会问题。 患者们在承受病痛的同时,还要面对来自不同社会环境中的偏见。更为复杂的是,这种污名化往往以微 妙、隐蔽的形式存在于日常对话中,即使是先进的人工智能系统也难以有效识别。 尽管自然语言处理领域在仇恨言论、攻击性语言检测方面已有不少研究,但专门针对心理健康污名的计 算资源却相对稀缺。现有数据集主要来 ...