Workflow
机器之心
icon
Search documents
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
突破通用领域推理的瓶颈!清华NLP实验室强化学习新研究RLPR
机器之心· 2025-06-27 00:49
Core Viewpoint - The article discusses the introduction of a novel reinforcement learning technique called Reinforcement Learning with Reference Probability Reward (RLPR), which addresses the limitations of existing methods in generalizing to diverse domains beyond mathematics and coding [4][24]. Group 1: RLPR Technology Overview - RLPR significantly enhances the quality of probability-based rewards through the Prob-to-Reward method, outperforming likelihood-based baseline methods in performance and training stability [7][24]. - The technology introduces a dynamic filtering mechanism based on reward standard deviation, further improving the stability and performance of reinforcement learning [8][17]. Group 2: Effectiveness of PR - The research team found that the generation probability of reference answers in large language models (LLMs) directly reflects the quality assessment of the model's reasoning process, indicating a strong correlation between the model's reasoning accuracy and the probability of generating correct reference answers [11][24]. - The PR mechanism effectively captures the model's self-assessment of reasoning quality, demonstrating its reliability in evaluating output [11][13]. Group 3: Advantages Over Existing Methods - Unlike existing RLVR methods that require extensive human resources for domain-specific validation rules, RLPR generates reward scores with a simple forward pass, making it more efficient in handling the complexity of natural language [13][24]. - RLPR's dynamic filtering mechanism retains samples with high reward standard deviation for training, enhancing training stability and effectiveness [17][24]. Group 4: Robustness and Validation - The research team evaluated the quality of different reward sources using the ROC-AUC metric, showing that PR outperformed rule-based rewards and verifier model rewards at a scale of 0.5 billion, with further improvements possible as model capabilities increase [19][21]. - RLPR demonstrated stable performance improvements across various training templates and base models, including Gemma and Llama, surpassing the performance of traditional rule-based RLVR baselines [22][24].
谷歌开源Gemma 3n:2G内存就能跑,100亿参数内最强多模态模型
机器之心· 2025-06-27 00:49
Core Viewpoint - Google has made a significant advancement in edge AI with the release of the new multimodal model Gemma 3n, which brings powerful multimodal capabilities to devices like smartphones, tablets, and laptops, previously only available on advanced cloud models [2][3]. Group 1: Model Features - Gemma 3n supports native multimodal input and output, including images, audio, video, and text [5]. - The model is optimized for device efficiency, with two versions (E2B and E4B) that require only 2GB and 3GB of memory to run, despite having original parameter counts of 5 billion and 8 billion respectively [5]. - The architecture includes innovative components such as the MatFormer architecture for computational flexibility and a new audio and vision encoder based on MobileNet-v5 [5][7]. Group 2: Architectural Innovations - The MatFormer architecture allows for elastic reasoning and dynamic switching between E4B and E2B inference paths, optimizing performance and memory usage based on current tasks [12]. - The use of per-layer embedding (PLE) technology significantly enhances memory efficiency, allowing a large portion of parameters to be loaded and computed on the CPU, reducing the memory burden on GPU/TPU [14][15]. Group 3: Performance Enhancements - Gemma 3n has achieved quality improvements in multilingual support, mathematics, coding, and reasoning, with the E4B version scoring over 1300 on the LMArena benchmark [5]. - The model introduces key-value cache sharing (KV Cache Sharing) to accelerate the processing of long content inputs, improving the time-to-first-token for streaming applications [18][19]. Group 4: Audio and Visual Capabilities - The audio capabilities of Gemma 3n include a universal speech model (USM) that generates tokens every 160 milliseconds, enabling high-quality speech-to-text transcription and translation [21]. - The MobileNet-V5-300M visual encoder provides advanced performance for multimodal tasks on edge devices, supporting various input resolutions and achieving high throughput for real-time video analysis [24][26]. Group 5: Future Developments - Google plans to release more details in an upcoming technical report on MobileNet-V5, highlighting its significant performance improvements and architectural innovations [28].
ICCV 2025放榜!录取率24%,夏威夷门票你抢到了吗?
机器之心· 2025-06-26 06:10
机器之心报道 编辑:+0 ICCV 2025 将于 10 月 19 日至 25 日在美国夏威夷举行。刚刚,ICCV 官方向投稿者发送了今年论文 接收结果的通知。 数据显示,今年大会共收到了 11239 份有效投稿,所有投稿均已进入审稿流程。程序委员会推荐录用 2699 篇论文,最终录用率为 24%。 对比前几届数据,2025 年的投稿量几乎接近 2019 年的三倍,这反映了计算机视觉领域的快速扩张和 学术研究的日益活跃。 尽管投稿数量大幅增加,ICCV 的录用率在过去几年中保持了相对稳定,基本维持在 25% - 26% 的 区间内。 继 CVPR 2025 之后,ICCV 2025 会议也实施了一项旨在强化问责制与诚信的新政策。程序委员会主 席团识别出了 25 名极不负责任的审稿人,并因此对与他们相关的 29 篇论文进行了直接拒稿处理。 这些被拒的论文中,有 12 篇若无此关联本应被录用,但这也引发了争议。 ICCV 2023 投稿 8260 篇,录用 2160 篇,录用率约为 26.15%。 ICCV 2021 投稿 6152 篇,录用 1612 篇,录用率为 26.20%。 ICCV 2019 投稿 43 ...
5款大模型考「山东卷」,Gemini、豆包分别获文理第一名
机器之心· 2025-06-26 06:10
编辑:杨文、+0 机器之心报道 测试全程未做任何 prompting engineering,所有输入均为高考原题,其中 DeepSeek R1 输入为题目文本,其 余模型则是题目文本和题目截图。在总分计算上,采用 3(语数外)+3(理综 / 文综)的形式对 5 个模型进 行排名。 从最终成绩单来看,这 5 家大模型的文科成绩均超 620 分, 如果按照山东高考的赋分制,豆包的 683 分可 以冲刺清华、北大 ;在理科方面,各大模型之间的分数差距则较为明显, Gemimi 和豆包已达到保底重点 985 的水准,而 Claude 4 和 o3 还不及 600 分。 去年高考全科测评中,大模型们还只能勉强踩到一本线,面对复杂的数学、物理题目时,虽然能产出答 案,但思路浅显、推理链条不够严密,常常给人一种「全靠蒙」的感觉。然而短短一年过去,技术更新带 来了质的飞跃,大模型展现出越来越强的逻辑推理和解决深度问题的能力。 语数英区分度较小,理科总分不及文科 在语、数、外等基础学科上,参评模型整体表现优异,均已达到顶尖考生水平,彼此间的区分度相对较 小。不过,o3 模型因作文跑题导致语文单科得分偏低,拖累了其总分。 今年 ...
人民大学&字节Seed:利用μP实现Diffusion Transformers高效扩展
机器之心· 2025-06-26 06:10
Core Viewpoint - The research introduces the μP theory to optimize hyperparameter tuning in diffusion Transformers, significantly reducing the computational cost of hyperparameter searches while enhancing model performance [2][24]. Group 1: Introduction and Background - The research is a collaboration between the Renmin University of China and ByteDance, focusing on optimizing diffusion Transformers, which have become essential in modern visual generation models [1]. - The μP theory, or Maximal Update Parametrization, is a significant milestone in the Tensor Program infinite-width network theory, allowing different-sized Transformers to share optimal hyperparameters [7]. Group 2: μP Theory Application - The μP theory has been successfully applied to diffusion Transformers, despite their architectural differences from standard Transformers, enabling effective hyperparameter transfer from smaller models to larger ones [8][10]. - The research demonstrates that hyperparameters can be effectively transferred across different model sizes, leading to improved training efficiency and performance [12][15]. Group 3: Experimental Validation - Systematic experiments were conducted on various models, including DiT, PixArt, and MMDiT, showing that hyperparameters found in smaller models can be successfully applied to larger models, achieving superior results compared to manually tuned baselines [2][21][24]. - In the MMDiT experiments, hyperparameters from a 0.18B model were successfully utilized in training an 18B model, with the computational cost of hyperparameter searches being only 3% of that required for manual tuning [21][24]. Group 4: Future Implications - The findings suggest that μP will be a crucial tool for the future expansion of foundational models, emphasizing the importance of theoretical advancements in AI for large-scale practical applications [24].
刚刚,OpenAI苏黎世办公室被Meta一锅端,三名ViT作者被挖走
机器之心· 2025-06-26 04:35
Core Viewpoint - Meta has aggressively recruited top AI researchers from OpenAI, indicating a strategic move to regain its competitive edge in the AI sector [3][6][9]. Group 1: Recruitment and Strategy - Meta CEO Mark Zuckerberg has successfully poached three researchers, Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai, from OpenAI's Zurich office [4][5]. - The recruitment is part of a broader strategy by Zuckerberg, who is personally reaching out to hundreds of top talents in the AI field, offering lucrative compensation packages, including offers worth up to $100 million [6][7]. - Meta's recent investment of $14 billion in AI startup Scale and the hiring of its CEO, Alexandr Wang, to lead a new superintelligence team further emphasizes its commitment to AI development [7]. Group 2: Responses from OpenAI - OpenAI CEO Sam Altman has downplayed concerns regarding the talent exodus, suggesting that the best talents are not leaving for Meta [9]. - In response to the recruitment efforts by Meta, OpenAI is also increasing funding and development opportunities for its researchers to retain talent [9]. Group 3: Background of Key Researchers - Xiaohua Zhai has a strong academic background, holding a PhD in Computer Science from Peking University and has been a significant contributor to multimodal research at Google DeepMind before joining OpenAI [12][14][15]. - Lucas Beyer, who has also been influential in AI research, completed his studies at RWTH Aachen University and has worked at Google Brain and DeepMind [18][20]. - Alexander Kolesnikov, with a PhD in machine learning and computer vision, has a notable research history at Google Brain and DeepMind before joining OpenAI [24][26].
具身世界模型新突破,地平线 & 极佳提出几何一致视频世界模型增强机器人策略学习
机器之心· 2025-06-26 04:35
近年来,随着人工智能从感知智能向决策智能演进, 世界模型 (World Models) 逐渐成为机器人领域的重要研究方向。世界模型旨在让智能体对环境进行建模并 预测未来状态,从而实现更高效的规划与决策。 与此同时,具身数据也迎来了爆发式关注。因为目前具身算法高度依赖于大规模的真实机器人演示数据,而这些数据的采集过程往往成本高昂、耗时费力,严重 限制了其可扩展性和泛化能力。尽管仿真平台提供了一种相对低成本的数据生成方式,但由于仿真环境与真实世界之间存在显著的视觉和动力学差异(即 sim-to- real gap),导致在仿真中训练的策略难以直接迁移到真实机器人上,从而限制了其实际应用效果。 因此如何高效获取、生成和利用高质量的具身数据,已成为当 前机器人学习领域的核心挑战之一 。 项目主页: https://horizonrobotics.github.io/robot_lab/robotransfer/ 模仿学习(Imitation Learning)已成为机器人操作领域的重要方法之一。通过让机器人 "模仿" 专家示教的行为,可以在复杂任务中快速构建有效的策略模型。然 而,这类方法通常依赖大量高质量的真实机器 ...
首个面向科学任务、真实交互、自动评估的多模态智能体评测环境,ScienceBoard来了
机器之心· 2025-06-26 00:30
用于辅助科学研究的大模型智能体,正在悄然发生变化 1 背景与动机 第一作者孙秋实是香港大学计算与数据科学学院博士生,硕士毕业于新加坡国立大学数据科学系。主要研究方向为 Computer-using agents 和 Code intelligence, 在 NLP 和 ML 顶会 ACL,EMNLP,ICLR,COLM 等发表多篇论文。本文的 OS-Copilot 团队此前已发布了 OS-Atlas、OS-Genesis 和 SeeClick 等同系列电脑智 能体研究成果,被广泛应用于学术界与产业实践中。 过去几年,随着 LLMs 和 VLMs 的飞速进步,我们见证了 AI 在自然语言处理、编程、图像理解等领域的广泛应用。而在科学研究这一关乎人类知识积累的关键 场域,基于这些强大模型的智能体正悄然成为科研工作流的 "新型合作者"。 在早期,AI 在科学中的角色往往是 "分析器"—— 帮助分析数据、撰写文献、生成图表。但随着电脑智能体(Computer-Using Agents,也称 CUA)的出现,这一 角色正在发生根本性转变。相比于传统的语言模型助手,这类智能体能够像人类一样操作计算机,通过图形界面点击、拖 ...
何恺明新身份:谷歌DeepMind杰出科学家
机器之心· 2025-06-26 00:30
Core Viewpoint - The article discusses the recent news of Kaiming He joining Google as a part-time Distinguished Scientist at DeepMind, highlighting his significant contributions to the field of AI and computer vision [2][4][24]. Group 1: Kaiming He's Background and Achievements - Kaiming He achieved the highest score in the 2003 Guangdong Province college entrance examination and was admitted to Tsinghua University [8]. - He completed his PhD at the Chinese University of Hong Kong under the supervision of Xiaoguo Tang and has held positions at Microsoft Research Asia, Facebook AI Research, and MIT [9]. - His research has received multiple awards, including the best paper award at CVPR in 2009 and 2016, and he has over 710,000 citations according to Google Scholar [10][12]. Group 2: Research Contributions - Kaiming He's most notable work includes the ResNet paper published in 2016, which has been cited over 280,000 times and is considered one of the most cited papers of the 21st century [15][18]. - His research addresses the gradient propagation problem in deep networks, establishing fundamental components for modern deep learning models [18][19]. - He has also contributed to the development of the Masked Autoencoders model, which has gained popularity in the computer vision community [20]. Group 3: Future Prospects at Google - The article expresses anticipation for Kaiming He's potential contributions at Google, particularly in the area of generative modeling, as suggested by his recent research [6][24].