EMBODIED WEB AGENTS：融合物理与数字领域以实现综合智能体智能

Group 1 - The article discusses the significant fragmentation in current AI agents, where network agents excel in handling digital information while embodied agents focus on physical interactions, leading to a lack of collaboration between the two domains [4] - The research team proposes a new paradigm called Embodied Web Agents (EWA) aimed at seamlessly bridging physical embodiment and network reasoning [4] Group 2 - A unified simulation environment is developed, integrating three major modules: outdoor environments based on Google Street View/ Earth API for real city navigation, indoor environments using AI2-THOR for high-fidelity kitchen scenes, and a self-built network environment with five functional websites [5][8][10] - The EWA-Bench benchmark is constructed, containing 1,500 tasks across five domains, with 75% of tasks requiring multiple environment switches to test cross-domain coordination capabilities [11] Group 3 - Experimental results show performance disparities among leading models like GPT-4o and Gemini, with overall accuracy rates of 34.72% for GPT and 30.56% for Gemini, compared to human accuracy of 90.28% [13] - The primary cause of errors is identified as cross-domain coordination issues, accounting for 66.6% of failures, with models performing well on pure web tasks but struggling with physical interactions [15] Group 4 - The article highlights the first formalization of the "embodied web agent" concept framework and the release of the first physical-digital integrated simulation environment [21] - Insights reveal that current large language models (LLMs) face significant bottlenecks in cross-domain collaboration, which is crucial for enhancing agent intelligence [22]