机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

人民大学&字节Seed：利用μP实现Diffusion Transformers高效扩展

机器之心· 2025-06-26 06:10

Core Viewpoint - The research introduces the μP theory to optimize hyperparameter tuning in diffusion Transformers, significantly reducing the computational cost of hyperparameter searches while enhancing model performance [2][24]. Group 1: Introduction and Background - The research is a collaboration between the Renmin University of China and ByteDance, focusing on optimizing diffusion Transformers, which have become essential in modern visual generation models [1]. - The μP theory, or Maximal Update Parametrization, is a significant milestone in the Tensor Program infinite-width network theory, allowing different-sized Transformers to share optimal hyperparameters [7]. Group 2: μP Theory Application - The μP theory has been successfully applied to diffusion Transformers, despite their architectural differences from standard Transformers, enabling effective hyperparameter transfer from smaller models to larger ones [8][10]. - The research demonstrates that hyperparameters can be effectively transferred across different model sizes, leading to improved training efficiency and performance [12][15]. Group 3: Experimental Validation - Systematic experiments were conducted on various models, including DiT, PixArt, and MMDiT, showing that hyperparameters found in smaller models can be successfully applied to larger models, achieving superior results compared to manually tuned baselines [2][21][24]. - In the MMDiT experiments, hyperparameters from a 0.18B model were successfully utilized in training an 18B model, with the computational cost of hyperparameter searches being only 3% of that required for manual tuning [21][24]. Group 4: Future Implications - The findings suggest that μP will be a crucial tool for the future expansion of foundational models, emphasizing the importance of theoretical advancements in AI for large-scale practical applications [24].

μP理论

超参迁移

人工智能

Diffusion Transformers

μP理论

超参迁移

人工智能

Diffusion Transformers

刚刚，OpenAI苏黎世办公室被Meta一锅端，三名ViT作者被挖走

机器之心· 2025-06-26 04:35

Core Viewpoint - Meta has aggressively recruited top AI researchers from OpenAI, indicating a strategic move to regain its competitive edge in the AI sector [3][6][9]. Group 1: Recruitment and Strategy - Meta CEO Mark Zuckerberg has successfully poached three researchers, Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai, from OpenAI's Zurich office [4][5]. - The recruitment is part of a broader strategy by Zuckerberg, who is personally reaching out to hundreds of top talents in the AI field, offering lucrative compensation packages, including offers worth up to $100 million [6][7]. - Meta's recent investment of $14 billion in AI startup Scale and the hiring of its CEO, Alexandr Wang, to lead a new superintelligence team further emphasizes its commitment to AI development [7]. Group 2: Responses from OpenAI - OpenAI CEO Sam Altman has downplayed concerns regarding the talent exodus, suggesting that the best talents are not leaving for Meta [9]. - In response to the recruitment efforts by Meta, OpenAI is also increasing funding and development opportunities for its researchers to retain talent [9]. Group 3: Background of Key Researchers - Xiaohua Zhai has a strong academic background, holding a PhD in Computer Science from Peking University and has been a significant contributor to multimodal research at Google DeepMind before joining OpenAI [12][14][15]. - Lucas Beyer, who has also been influential in AI research, completed his studies at RWTH Aachen University and has worked at Google Brain and DeepMind [18][20]. - Alexander Kolesnikov, with a PhD in machine learning and computer vision, has a notable research history at Google Brain and DeepMind before joining OpenAI [24][26].

Meta Platforms(US:META)

Artificial Intelligence

ViT

Artificial Intelligence

ViT

具身世界模型新突破，地平线 & 极佳提出几何一致视频世界模型增强机器人策略学习

机器之心· 2025-06-26 04:35

近年来，随着人工智能从感知智能向决策智能演进，世界模型（World Models）逐渐成为机器人领域的重要研究方向。世界模型旨在让智能体对环境进行建模并预测未来状态，从而实现更高效的规划与决策。与此同时，具身数据也迎来了爆发式关注。因为目前具身算法高度依赖于大规模的真实机器人演示数据，而这些数据的采集过程往往成本高昂、耗时费力，严重限制了其可扩展性和泛化能力。尽管仿真平台提供了一种相对低成本的数据生成方式，但由于仿真环境与真实世界之间存在显著的视觉和动力学差异（即 sim-to- real gap），导致在仿真中训练的策略难以直接迁移到真实机器人上，从而限制了其实际应用效果。因此如何高效获取、生成和利用高质量的具身数据，已成为当前机器人学习领域的核心挑战之一。项目主页： https://horizonrobotics.github.io/robot_lab/robotransfer/ 模仿学习（Imitation Learning）已成为机器人操作领域的重要方法之一。通过让机器人 "模仿" 专家示教的行为，可以在复杂任务中快速构建有效的策略模型。然而，这类方法通常依赖大量高质量的真实机器 ...

首个面向科学任务、真实交互、自动评估的多模态智能体评测环境，ScienceBoard来了

机器之心· 2025-06-26 00:30

用于辅助科学研究的大模型智能体，正在悄然发生变化 1 背景与动机第一作者孙秋实是香港大学计算与数据科学学院博士生，硕士毕业于新加坡国立大学数据科学系。主要研究方向为 Computer-using agents 和 Code intelligence，在 NLP 和 ML 顶会 ACL，EMNLP，ICLR，COLM 等发表多篇论文。本文的 OS-Copilot 团队此前已发布了 OS-Atlas、OS-Genesis 和 SeeClick 等同系列电脑智能体研究成果，被广泛应用于学术界与产业实践中。过去几年，随着 LLMs 和 VLMs 的飞速进步，我们见证了 AI 在自然语言处理、编程、图像理解等领域的广泛应用。而在科学研究这一关乎人类知识积累的关键场域，基于这些强大模型的智能体正悄然成为科研工作流的 "新型合作者"。在早期，AI 在科学中的角色往往是 "分析器"—— 帮助分析数据、撰写文献、生成图表。但随着电脑智能体（Computer-Using Agents，也称 CUA）的出现，这一角色正在发生根本性转变。相比于传统的语言模型助手，这类智能体能够像人类一样操作计算机，通过图形界面点击、拖 ...

何恺明新身份：谷歌DeepMind杰出科学家

机器之心· 2025-06-26 00:30

Core Viewpoint - The article discusses the recent news of Kaiming He joining Google as a part-time Distinguished Scientist at DeepMind, highlighting his significant contributions to the field of AI and computer vision [2][4][24]. Group 1: Kaiming He's Background and Achievements - Kaiming He achieved the highest score in the 2003 Guangdong Province college entrance examination and was admitted to Tsinghua University [8]. - He completed his PhD at the Chinese University of Hong Kong under the supervision of Xiaoguo Tang and has held positions at Microsoft Research Asia, Facebook AI Research, and MIT [9]. - His research has received multiple awards, including the best paper award at CVPR in 2009 and 2016, and he has over 710,000 citations according to Google Scholar [10][12]. Group 2: Research Contributions - Kaiming He's most notable work includes the ResNet paper published in 2016, which has been cited over 280,000 times and is considered one of the most cited papers of the 21st century [15][18]. - His research addresses the gradient propagation problem in deep networks, establishing fundamental components for modern deep learning models [18][19]. - He has also contributed to the development of the Masked Autoencoders model, which has gained popularity in the computer vision community [20]. Group 3: Future Prospects at Google - The article expresses anticipation for Kaiming He's potential contributions at Google, particularly in the area of generative modeling, as suggested by his recent research [6][24].

8B模型可以超过GPT-4o！并行KV Cache压缩支持的128K长度外推方法ParallelComp

机器之心· 2025-06-25 06:50

作者熊璟，香港大学一年级博士生，师从黄毅教授和孔令鹏教授。已在 ICLR、ICML、NeurIPS、ACL、EMNLP、TMLR等顶级会议/期刊发表论文，研究方向为高效大语言模型推理与自动定理证明。担任NAACL、EMNLP、ACL、ICML、ICLR、NeurIPS、COLING等会议审稿人。个人主页： https://menik1126.github.io/ 引言：大模型长文本推理的瓶颈与突破随着大语言模型（LLMs）能力日益提升，AI 对超长文本的理解和处理需求也变得前所未有地重要。然而，目前主流 LLM 虽然依赖旋转位置编码（RoPE）等机制，在训练阶段能高效处理 4K-8K tokens 级别的上下文，但一旦推理阶段外推遇到如 128K 以上长度的长文本时，模型往往受到显存瓶颈的限制和注意力下沉 (attention sink) 等问题影响，采用常规的文本截断方案容易出现信息遗失，这极大限制了大模型在实际场景中的应用拓展。业界目前尝试的处理长文本的高效推理主要的瓶颈有两个, 一个是位置编码的长度外推, 再一个是长度外推中的内存瓶颈。目前的位置编码包括两类：一是基于频率区分的 NTK 插值方 ...

大语言模型长文本推理

位置编码长度外推

长度外推内存瓶颈

Artificial Intelligence

Artificial Intelligence

ParallelComp

GPT-4o

让多模态大模型「想明白再画」！港大等开源GoT-R1：强化学习解锁视觉生成推理新范式

机器之心· 2025-06-25 06:50

Core Viewpoint - The article discusses the significant advancements in multimodal large models for generating high-fidelity images from complex text prompts, while also highlighting the challenges faced in accurately interpreting spatial relationships and multi-object attributes [1][2]. Group 1: Introduction of GoT-R1 - A research team from the University of Hong Kong, Chinese University of Hong Kong, and SenseTime has introduced GoT-R1, an important advancement following the Generation Chain-of-Thought (GoT) framework [2]. - GoT-R1 enhances the semantic-spatial reasoning capabilities of multimodal large models through the innovative application of reinforcement learning, allowing the model to autonomously explore and learn better reasoning strategies [3][5]. Group 2: Limitations of GoT Framework - The GoT framework improves image generation accuracy and controllability by explicitly planning semantic content and spatial layout before image generation, but its reasoning capabilities are limited by supervised fine-tuning data based on predefined templates [4][13]. - GoT-R1 aims to overcome these limitations by introducing reinforcement learning into the semantic-spatial reasoning process, enabling the model to learn and optimize reasoning paths independently [5][13]. Group 3: Reward Mechanism in GoT-R1 - GoT-R1 constructs a comprehensive and effective reward mechanism for visual generation tasks, evaluating multiple dimensions of the generated results, including semantic consistency, spatial accuracy, and overall aesthetic quality [13][14]. - The reward framework includes: 1. Reasoning Process Evaluation Reward (RPR) [14] 2. Reasoning-to-Image Alignment Reward (RRI), which quantifies adherence to the reasoning chain using Intersection over Union (IoU) [15] 3. Semantic Alignment Reward (Rsem) and Spatial Alignment Reward (Rspa), which assess the completeness and accuracy of the reasoning chain against the original text prompt [16] 4. Text-to-Image Alignment Reward (RPI), which evaluates the overall consistency of the generated image with the original text prompt [17]. Group 4: Performance Evaluation of GoT-R1 - GoT-R1 was evaluated on the challenging T2I-CompBench, where it established new state-of-the-art (SOTA) performance, achieving the highest scores in five out of six evaluation categories [21][23]. - The model demonstrated significant advantages in handling complex, multi-layered instructions, particularly in the "Complex" benchmark [23]. - Compared to the baseline model, GoT-R1-7B achieved up to a 15% improvement in evaluation metrics, showcasing the effectiveness of reinforcement learning in enhancing the model's reasoning capabilities [24][25]. Group 5: Comparison of Reasoning Chains - A comparative analysis using GPT-4o revealed that GoT-R1 generated reasoning chains were preferred over those from the baseline model across all evaluation categories, particularly in spatial relationship understanding [25][26].

多模态大模型

强化学习

Artificial Intelligence

Artificial Intelligence

GoT框架

GoT-R1框架

Stable Diffusion

机器人顶会RSS 2025奖项公布！大牛Pieter Abbeel领衔研究获杰出Demo奖

机器之心· 2025-06-25 06:50

机器之心报道机器之心编辑部恭喜获奖者。 RSS（Robotics: Science and Systems，机器人科学与系统会议）是机器人领域顶级学术会议，自 2005 年起每年举办一次，该会议旨在促进机器人领域的科学研究和技术应用的发展。地址：https://roboticsconference.org/program/awards/ 杰出 Demo 论文奖论文标题：Demonstrating MuJoCo Playground 论文摘要：该研究提出了 MuJoCo Playground—— 这是一个基于 MJX 构建的完全开源机器人学习框架，其核心设计目标是大幅简化仿真环境搭建、模型训练以及仿真到现实场景的迁移全流程。研究人员仅需执行简单的「pip install playground」安装命令，即可在单 GPU 硬件上完成分钟级策略训练。该框架支持四足机器人、人形机器人、灵巧手及机械臂等多类型机器人平台，能够直接基于状态观测或像素级输入实现零样本仿真到现实迁移。今年大会已于 6 月 21 日至 25 日在美国洛杉矶举行。杰出 Demo 论文奖、杰出系统论文奖、杰出学生论文奖、杰出论文奖 ...

提示词工程、RAG之后，LangChain：上下文工程开始火了！

机器之心· 2025-06-25 04:06

Core Viewpoint - Context engineering is emerging as a crucial skill for AI engineers, shifting the focus from traditional prompt engineering to providing structured and dynamic context for large language models (LLMs) to perform tasks effectively [3][7][15]. Group 1: Definition and Importance of Context Engineering - Context engineering involves constructing dynamic systems that provide accurate information and tools in the right format, enabling LLMs to complete tasks effectively [9][10]. - The significance of context engineering lies in its ability to address common failures in AI systems, which often stem from inadequate context or incorrect information being provided to the model [12][15]. - Unlike prompt engineering, which focuses on crafting clever prompts, context engineering emphasizes the importance of delivering complete and structured context to enhance model performance [17][19]. Group 2: Components of Effective Context Engineering - Effective context engineering requires accurate information, as models cannot infer context without being explicitly provided with it [12][19]. - The format of the context is critical; how information is communicated to the LLM can significantly impact its responses [13][19]. - Tools must be appropriately utilized to access external information, and the returned data should be formatted in a way that is easily understandable by the LLM [20]. Group 3: Transition from Prompt Engineering to Context Engineering - The transition from prompt engineering to context engineering is driven by the increasing complexity of applications, highlighting the need for a more comprehensive approach to context provision [16][17]. - Context engineering can be viewed as a subset of prompt engineering, where the focus shifts from single input prompts to managing and formatting dynamic data sets [17][18].

具身智能的终极命题：是造「人」还是造「生产力」？

机器之心· 2025-06-25 04:06

机器之心报道编辑：吴昕华为开发者大会 2025（HDC 2025）上发布了 CloudRobo 具身智能平台。该平台可视为具身智能的「技术底座」，通过云端的「强智能」赋能机器本体，规避了本体侧智能进程慢，且部署成本高的痛点，摸索出一条涉猎范围最广、实现速度最快的具身智能落地路径。「华为云的目标是让一切联网的本体都成为具身智能机器人。」华为云计算 CEO 张平安说道。不做「本体」转而去做云端的技术赋能，华为云的布局思路虽是更符合自身需求的战略方向，但也为具身智能带来了发展新视角。具身智能追求的并不是本体「构型」，或是本体的智能程度，而是站在「更好用」的终局视角，从人形到移动机器人再到卡车，让一切机器「具身智能化」，加速其在物理世界真正用起来的脚步。这种终局思维，极大拓宽了具身智能产业化的想象空间，并为商业落地指明了潜在的效率最优路径。工业领域的实践印证了这条路径的可行性：在工业喷涂领域，CloudRobo 助力埃夫特机械臂快速适应新喷涂任务；在半导体制造领域，CloudRobo 赋能优艾智合物流机器人，实时同步生产系统，更新任务规划，完成物料搬运及运输。其合作方优艾智合、埃夫特等 ...