机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

机器之心· 2025-08-08 10:18

机器之心报道编辑：泽南会整理家务、制作香囊，还能比心比耶。具身智能已经进化到这种程度了，真实满足用户需求似乎指日可待。今天上午，2025 世界机器人大会 WRC 正式开幕。最近的 AI 技术突破，让参展的公司纷纷拿出了新技术，人头攒动的展台之间，我们看到了一众「具身智能」加持的机器人，其中很多还是首次发布。它们从整理家务、工业物流、制造装配到跳舞表演可谓无所不能，形态也各式各样，颇有些前不久大模型「百模大战」的样子。不过在这其中，有一家的公司反其道而行之，实现了「一脑多用」的真・通用智能。国内头部创业公司「自变量机器人」给具身智能定义了一个新的标准。一脑多用覆盖多种场景在 WRC 的展台上，自变量的通用轮式双臂机器人「小量」在制作香囊，为看展观众们送上专属小礼物。它内置自变量自研的通用具身大模型 WALL-A，短短几天时间内就学会了自主制作香囊，而且工作时不挑环境 —— 不论展会现场周遭的声光环境有多复杂，人群动向如何，都不会干扰到它的「细活」。如果提出指令，机器人能根据观众的喜好，自主拾取不同的香包，在不到 10cm 的空间中，双臂精巧配合，处理柔性物体的复杂形变，完成香囊的制作 ...

4比0横扫Grok 4，o3强势夺冠，首届大模型对抗赛结果出炉

机器之心· 2025-08-08 10:18

Core Viewpoint - The first Kaggle AI Chess Championship concluded with o3 defeating Grok 4 decisively, showcasing the advancements in AI chess models and their competitive capabilities [2][4][15]. Group 1: Championship Results - o3 won the championship by sweeping Grok 4 with a score of 4-0 [4][15]. - Gemini 2.5 Pro secured third place by defeating o4-mini with a score of 3.5-0.5 [4][17]. Group 2: Performance Analysis - Grok 4, initially a strong contender, made critical mistakes during the final match, leading to its unexpected defeat [6][7][8]. - In the first game, Grok 4 lost a piece early on, which set a negative tone for the rest of the match [8][10]. - The second game featured a risky opening strategy from Grok 4 that resulted in a significant blunder, allowing o3 to capitalize easily [10][12]. - The third game saw Grok 4 fail to maintain its position, leading to a complete loss despite initial promise [12][13]. - The final game was closely contested, but o3 demonstrated superior endgame skills, ultimately securing victory [13][15]. Group 3: Insights on Competitors - Gemini 2.5 Pro's performance was marked by inconsistency, with several amateur-level mistakes during its matches [17][19]. - Despite the chaotic nature of the matches, Gemini managed to secure third place, indicating potential for future improvements [24].

大模型国际象棋对抗

Artificial Intelligence

Artificial Intelligence

GPT - 5

Grok 4

Gemini 2.5 Pro

扩散LLM推理新范式：打破生成长度限制，实现动态自适应调节

机器之心· 2025-08-08 10:18

随着 Gemini-Diffusion，Seed-Diffusion 等扩散大语言模型（DLLM）的发布，这一领域成为了工业界和学术界的热门方向。但是，当前 DLLM 存在着在推理时必须采用预设固定长度的限制，对于不同任务都需要专门调整才能达到最优效果。为了解决这一本质的问题，香港中文大学 MMLab，上海 AI 实验室等提出 DAEDAL，赋予 DLLM 可以根据问题的具体情况自主调整回答长度的能力，弥补了 DLLM 与自回归 LLM 的关键差距，为更灵活、高效、强大的扩散大语言模型打下了基石。论文标题：Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models 论文地址：https://arxiv.org/abs/2508.00819 代码地址：https://github.com/Li-Jinsong/DAEDAL DAEDAL 作为一种 Training Free 的去噪策略，从一个统一且很短的初始长度开始，让模型根据自己的需求在生成中调节长度，动态扩展，达到了和现有去噪策略在每个评测基准上精心调整生成 ...

动态自适应调节

去噪策略

Artificial Intelligence

Artificial Intelligence

DAEDAL

扩散大语言模型（DLLM）

从Debugger到Developer : 低代码时代新基准NoCode-bench，SWE-Bench作者力荐

机器之心· 2025-08-08 07:53

Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].

GPT-5真的拉胯吗？机器之心一手实测，网友：还我4o、还我4.5

机器之心· 2025-08-08 07:53

一觉醒来，朋友圈被 GPT-5 刷了屏。在昨晚长达一个多小时的发布直播中，OpenAI 介绍了 GPT-5 的性能，演示了诸多实用案例，在此不赘述，感兴趣的朋友可以移步：刚刚，奥特曼发布 GPT-5！人人免费用「博士级」智能，基准图错误遭全网吐槽。奥特曼发推表示，GPT-5 是我们迄今为止最智能的模型。机器之心报道机器之心编辑部有人给好评，有人给差评。不过，网上对 GPT-5 的评价褒贬不一。有人表示，提前体验 GPT-5 将近两周，发现它展现了巨大的进步，超越了之前的版本，并且在科学推理、事实准确性和创意表达方面达到了新的高度。 LMArena 基准测评结果也已出炉，GPT-5 在文本、网页开发、视觉领域、难题、编程、数学、创意、长查询等各个领域都排名第一。网友 @emollick 则认为 GPT-5 非常聪明，并且能完成各种任务，是一个非常重大的突破。比如让它制作一个程序化的野兽派建筑生成器，可以以酷炫的方式拖拽和编辑建筑、并不断改进它。也有人表示，GPT-5在前端体验、减少幻觉和提升写作质量方面有显著改进，免费用户和企业用户将感受到明显的提升。但也有不少人给出了差评。网友 @ ...

人工智能

Artificial Intelligence

GPT-5

人工智能

Artificial Intelligence

GPT-5

刚刚，奥特曼发布GPT-5！人人免费用「博士级」智能，基准图错误遭全网吐槽

机器之心· 2025-08-07 20:48

机器之心报道机器之心编辑部都看了吗？等了多年的 GPT-5，终于在这个凌晨发布了。我们一脸的期待，直播中 OpenAI 几位核心人员的紧张也肉眼可见。直播过程中，奥特曼也是连发十几条推特，介绍 GPT-5 的看点。因为信息点比较多，我们就以奥特曼的推特内容为依据为大家一一介绍。首先，这是一个集成模型。也就是说，你用它的时候不需要在不同模型之间切换，它会自己决定何时需要深入思考。尽管奥特曼强调 benchmark 不重要，但他们还是晒出了不少跑分结果，比如在数学、编程、视觉感知和健康领域。具体跑分如下：费用方面，GPT-5 分为免费版、Plus 和 Pro 计划。根据奥特曼的说法，免费版也能用上「博士级别的智能」（GPT-5 普通版，但带推理功能），Plus 用户在使用频率上限制更少，而 Pro 用户可以用上 GPT-5 Pro。面向开发者，GPT-5 的三个版本 API 价格如下：标准版 GPT-5 为每百万输入 Token 1.25 美元，每百万输出 Token 10 美元，GPT-5 mini 版与 Nano 版会更便宜。数学领域：在 2025 年 AIME 测试中无 ...

北大、字节跳动联手发布SWE-Swiss：一把修复代码Bug的「瑞士军刀」，完整配方直指开源SOTA

机器之心· 2025-08-07 20:48

图 1: SWE-bench Verified 上的性能与模型尺寸对比。该研究的 32B 模型 SWE-Swiss ，取得了 60.2% 的顶级分数，与更大的模型如 Kimi-Dev, DeepSeek-R1-0528 处于同一梯队。这证明了该研究的训练配方能让一个小得多的模型达到同样的 SOTA 性能级别，凸显了其卓越的效率。近日，一项由北京大学、字节跳动 Seed 团队及香港大学联合进行的研究，提出了一种名为「SWE-Swiss」的完整「配方」，旨在高效训练用于解决软件工程问题的 AI 模型。研究团队推出的 3 2 B 参数模型 SWE-Swiss-32B，在权威基准 SWE-bench Verified 上取得了 60.2% 的准确率，在同尺寸级别中达到了新的 SOTA。该工作证明，通过精巧的方法论设计，中等规模的模型完全有能力实现顶级性能，为 AI 在软件工程领域的应用提供了新的思路。为促进社区发展，该研究的模型、数据集将全部开源。引言：软件工程 AI 的挑战与机遇自动化解决真实世界的软件问题，是大型语言模型（LLM）面临的一项艰巨挑战。相较于纯粹的代码生成，这项任务要求模型具备理解 ...

颠覆互联网的下一波浪潮：Agentic Web来了！

机器之心· 2025-08-07 10:30

Core Viewpoint - The article discusses the emergence of the "Agentic Web," a new paradigm in internet usage where AI agents autonomously complete tasks based on user-defined goals, marking a significant shift from traditional web interactions [3][6][57]. Group 1: Paradigm Shifts in the Web - The internet has undergone three major paradigm shifts: from a keyword-driven "directory web" in the PC era, to a recommendation-driven "content explosion" in the mobile era, and now to an "action network" driven by AI agents [8][9][15]. - In the Agentic Web, the role of the web transitions from being an information repository to an ecosystem of actionable resources for AI agents [13][15]. Group 2: Definition and Structure of Agentic Web - The Agentic Web is defined as a distributed, interactive ecosystem where AI agents, powered by large language models (LLMs), continuously plan, coordinate, and execute goal-oriented tasks [16][17]. - Users interact with the web by delegating tasks to AI agents, which autonomously handle the execution of these tasks [20][21]. Group 3: Core Dimensions of Agentic Web - The structure of the Agentic Web can be understood through three core dimensions: intelligence, interaction, and economy [24][28]. - The "Agent Attention Economy" signifies a shift in focus from human clicks to AI agent interactions, changing the metrics of commercial competition [29]. Group 4: Application Scenarios - The capabilities of the Agentic Web can be categorized into transactional, informational, and communicational tasks, enabling AI agents to perform a wide range of functions from booking tickets to conducting research [30][31]. - In transactional tasks, users can simply state their needs, and AI agents will autonomously complete the entire process, enhancing efficiency [33]. Group 5: Challenges Ahead - The implementation of the Agentic Web faces systemic challenges, including the need for improved AI capabilities, robust network infrastructure, and a redefined economic model [42][43]. - Key challenges include ensuring the reasoning and memory capabilities of AI agents, managing security risks associated with external tools, and establishing effective communication protocols for multi-agent collaboration [44][51]. Group 6: Socio-Economic Implications - The rise of the Agentic Web poses significant implications for traditional advertising models, labor markets, and economic structures, necessitating a reevaluation of how value is created and distributed in the digital economy [56][58]. - As AI agents become more prevalent, there is a pressing need to address the potential displacement of jobs and ensure equitable economic benefits [56][58].

云计算一哥首度牵手OpenAI，大模型「选择」自由，才是终极胜利

机器之心· 2025-08-07 10:30

Core Viewpoint - The collaboration between Amazon Web Services (AWS) and OpenAI marks a significant shift in the AI cloud service landscape, breaking Microsoft's monopoly on reselling OpenAI's software and services, and enhancing AWS's competitive edge in the large model cloud service market [3][15]. Summary by Sections Collaboration Announcement - AWS announced support for OpenAI's newly open-sourced models, gpt-oss (120b and 20b), and Anthropic's Claude Opus 4.1, through its platforms Amazon Bedrock and Amazon SageMaker AI [1][4][16]. Strategic Importance - This partnership allows AWS to fill a critical gap in its model library, enhancing its "Choice Matters" strategy, which emphasizes the importance of diverse model options for various industry needs [7][10][15]. Model Ecosystem Development - AWS's platforms now host over 400 mainstream commercial and open-source large models, facilitating a diverse AI ecosystem that accelerates technology adoption and innovation in the AI industry [10][18]. Performance and Cost Efficiency - The performance of gpt-oss-120b is reported to be three times more cost-effective than Google's Gemini, five times that of DeepSeek-R1, and twice that of OpenAI's o4, providing budget-friendly access to top-tier AI capabilities for small and medium enterprises [14][15]. Enhanced Model Deployment - AWS's Amazon SageMaker JumpStart allows for rapid deployment of advanced foundational models, including OpenAI's offerings, enabling efficient customization and optimization for AI applications [14][24]. Future Prospects - The collaboration is expected to create a win-win situation, expanding OpenAI's market reach while solidifying AWS's position as a leading platform for deploying and running various AI models [15][19]. AI Ecosystem Transformation - AWS is evolving from a cloud service provider to an AI capability aggregation platform, enhancing its role in the AI ecosystem and providing better service to customers and developers [19][29]. Model Selection Flexibility - The "Choice Matters" strategy addresses the diverse needs of different tasks, allowing developers to select models based on specific requirements, thus maximizing efficiency and effectiveness in AI applications [21][24]. Conclusion - The integration of multiple models into a single platform is anticipated to lead to a significant surge in AI application development, enabling innovative solutions through the combination of various models [30][31].

让AI读懂「言外之意」：AI4SG团队发布首个心理健康污名语料库，破解隐性偏见识别难题

机器之心· 2025-08-07 09:42

论文第一作者 Han Meng 是新加坡国立大学博士生，从事心理学构建的计算方法研究。通讯作者 Yi-Chieh Lee 是新加坡国立大学助理教授，在对话式人工智能、人机交互和心理健康技术领域开展研究工作。共同作者 Renwen Zhang 是南洋理工大学助理教授，专注于计算传播学研究，为本研究提供了传播学视角。 Jungup Lee 是新加坡国立大学副教授，在心理健康领域有深入研究，为本研究提供了重要的领域知识支撑。心理健康问题影响着全球数亿人的生活，然而患者往往面临着双重负担：不仅要承受疾病本身的痛苦，还要忍受来自社会的偏见和歧视。世界卫生组织数据显示，全球有相当比例的心理健康患者因为恐惧社会歧视而延迟或拒绝治疗。这种「污名化」现象如同隐形的障碍，不仅阻碍了患者的康复之路，更成为了一个重要的社会问题。患者们在承受病痛的同时，还要面对来自不同社会环境中的偏见。更为复杂的是，这种污名化往往以微妙、隐蔽的形式存在于日常对话中，即使是先进的人工智能系统也难以有效识别。尽管自然语言处理领域在仇恨言论、攻击性语言检测方面已有不少研究，但专门针对心理健康污名的计算资源却相对稀缺。现有数据集主要来 ...

心理健康污名

人工智能

高等教育

心理健康污名访谈语料库MHStigmaInterview

心理健康污名

人工智能

高等教育

心理健康污名访谈语料库MHStigmaInterview

Previous Next