共同编码理论(Common Coding Theory)

Search documents
ICCV 2025 | 浙大、港中文等提出EgoAgent:第一人称感知-行动-预测一体化智能体
机器之心· 2025-10-16 04:51
Core Insights - The article discusses the development of EgoAgent, a first-person joint predictive agent model that learns visual representation, human action, and world state prediction simultaneously, inspired by human cognitive learning mechanisms [2][5][21] - EgoAgent breaks the traditional separation of perception, control, and prediction in AI, allowing for a more integrated learning approach [6][21] Group 1: Model Overview - EgoAgent is designed to simulate the continuous interaction between the human brain, body, and environment, enabling AI to learn through experience rather than just observation [5][6] - The model employs a core architecture called JEAP (Joint Embedding-Action-Prediction) that allows for joint learning of the three tasks within a unified Transformer framework [6][8] Group 2: Technical Mechanisms - EgoAgent utilizes an interleaved "state-action" joint prediction approach, encoding first-person video frames and 3D human actions into a unified sequence [8][10] - The model features a collaborative mechanism between a Predictor and an Observer, enhancing its self-supervised learning capabilities over time [8][10] Group 3: Performance and Results - EgoAgent demonstrates superior performance in key tasks, significantly outperforming existing models in first-person world state prediction, 3D human motion prediction, and visual representation [12][13][15] - For instance, EgoAgent with 300 million parameters improved Top-1 accuracy by 12.86% and mAP by 13.05% compared to the latest first-person visual representation model [13] Group 4: Future Applications - The model has broad application prospects, particularly in robotics and AR/VR, enhancing scene perception and interaction capabilities in complex environments [21]