Workflow
SpatialTracker
icon
Search documents
浙江大学研究员彭思达:底层空间感知技术对训练机器人有何作用?丨GAIR 2025
雷峰网· 2025-12-15 07:44
Core Viewpoint - The article discusses advancements in embodied intelligence and general spatial perception technologies, focusing on the work of a research team led by Peng Sida from Zhejiang University. The team aims to enhance robotic capabilities through improved camera pose estimation, depth estimation, and object motion estimation, which are essential for decision-making and training data generation for humanoid robots [2][3]. Group 1: Camera Pose Estimation - The traditional method for camera pose estimation is Colmap, which extracts features from images and matches them to generate a 3D point cloud [5]. - A new approach using the Transformer model, called LoFTR, has been proposed to improve image matching, especially in challenging environments [9]. - The MatchAnything method enhances cross-modal matching capabilities, allowing for better integration of various data sources, such as infrared and visible light images [10][11]. - The Detector-free SfM method addresses the limitations of existing algorithms by iteratively optimizing multi-view matching and 3D models [13][16]. - The VGGT model significantly speeds up camera position estimation, reducing processing time from hours to seconds, but struggles with large-scale scenes [21][23]. - The Scal3R method introduces a global view for better consistency in local scene predictions, enhancing the overall model performance [24]. Group 2: Depth Estimation - Depth estimation is crucial for embodied intelligence, and the Pixel-Perfect-Depth approach aims to eliminate "flying points" in depth predictions by optimizing directly in pixel space [31][34]. - This method also improves video depth estimation by maintaining temporal continuity and integrating semantic features [36]. - The Prompt Depth Anything algorithm enhances robotic grasping capabilities and can be applied in various fields, including autonomous driving [41]. - The InfiniDepth solution provides a more comprehensive geometric estimation by evaluating not just pixel depth but also sub-pixel depth, improving precision in robotic applications [43]. Group 3: Object Motion Estimation - The conversion of human behavior data into effective training data is essential for the development of embodied intelligence, relying on depth information, camera motion, and semantic tracking [45]. - The SpatialTracker model enhances tracking capabilities by projecting 2D images into 3D space, allowing for improved trajectory optimization using Transformer models [48].
GAIR 2025 世界模型分论坛:从通用感知到视频、物理世界模型的百家争鸣
雷峰网· 2025-12-13 09:13
Core Insights - The article discusses the current state and future prospects of world models in the context of embodied intelligence, highlighting the diverse research directions and the need for consensus in the field [2][3]. Group 1: General Overview of World Models - The GAIR Global AI and Robotics Conference featured a forum on world models, showcasing various young scholars who presented their research on topics such as general perception, 3D technology, and digital human reconstruction [2]. - The research on world models is still in its infancy, with many subfields emerging, indicating a rich and varied landscape of inquiry [2]. Group 2: Key Presentations and Innovations - 彭思达 from Zhejiang University presented on general spatial perception technologies for embodied intelligence, focusing on camera pose estimation, depth estimation, and object motion estimation, which are crucial for robotic decision-making [5][6]. - 彭思达's team proposed a new method for camera pose estimation using Transformer models to improve image matching in challenging environments, enhancing the accuracy of spatial perception [7]. - The team also introduced the "Pixel-Perfect-Depth" approach to improve depth estimation by optimizing directly in pixel space, avoiding information loss associated with traditional models [8]. Group 3: Advancements in Digital Human Reconstruction - 修宇亮 from Westlake University discussed high-precision digital human reconstruction, presenting the UP2You method that significantly reduces modeling time from 4 hours to 1.5 minutes by converting noisy data into usable multi-view images [20][21]. - The ETCH method was introduced to accurately model internal human structures by defining the relationship between clothing and skin, addressing previous inaccuracies in modeling [22]. - 修宇亮 emphasized that the future of digital human reconstruction will increasingly rely on fine-tuning existing foundational models rather than starting from scratch [23]. Group 4: Innovations in Physical World Modeling - 王广润 from Sun Yat-sen University presented on enhancing physical world modeling through a new model called the in-situ Tweedie discrete diffusion model, which aims to improve data training efficiency and model performance [26][27]. - The presentation highlighted the need for a decoupling of physical modeling and spatial modeling to enhance the adaptability of AI systems in real-world applications [28]. Group 5: The Role of 3D Technology in AI - 韩晓光 from The Chinese University of Hong Kong discussed the evolution of 3D generation technology and its critical role in video generation, emphasizing the need for 3D models to maintain relevance in the face of advancements in 2D video generation [31][32]. - He identified key trends in 3D generation, including increased detail, structural organization, and alignment with 2D inputs, while also addressing the challenges posed by video generation technologies [32][33]. - 韩晓光 concluded that 3D technology is essential for creating trustworthy AI systems, as it provides a more interpretable representation compared to high-dimensional latent variables [34]. Group 6: Future Directions and Collaborative Efforts - The roundtable discussion emphasized the importance of collaboration and consensus in the development of world models, with participants sharing insights on the need for hardware advancements alongside algorithm improvements [37][39]. - The discussion highlighted the potential for a technical alliance focused on world models to foster cooperation and innovation in the field [39].