具身人工智能

Search documents
上下文即记忆!港大&快手提出场景一致的交互式视频世界模型,记忆力媲美Genie3,且更早问世!
量子位· 2025-08-21 07:15
Core Viewpoint - The article discusses a new framework called "Context-as-Memory" developed by a research team from the University of Hong Kong and Kuaishou, which significantly improves scene consistency in interactive long video generation by efficiently utilizing historical context frames [8][10][19]. Summary by Sections Introduction to Context-as-Memory - The framework addresses the issue of scene inconsistency in AI-generated videos by using a memory retrieval system that selects relevant historical frames to maintain continuity [10][19]. Types of Memory in Video Generation - Two types of memory are identified: dynamic memory for short-term actions and behaviors, and static memory for scene-level and object-level information [12][13]. Key Concepts of Context-as-Memory - Long video generation requires long-term historical memory to maintain scene consistency over time [15]. - Memory retrieval is crucial, as directly using all historical frames is computationally expensive; a memory retrieval module is needed to filter useful information [15]. - Context memory is created by concatenating selected context frames with the input, allowing the model to reference historical information during frame generation [15][19]. Memory Retrieval Method - The model employs a camera trajectory-based search method to select context frames that overlap significantly with the current frame's visible area, enhancing both computational efficiency and scene consistency [20][22]. Dataset and Experimental Results - A dataset was created using Unreal Engine 5, containing 100 videos with 7601 frames each, to evaluate the effectiveness of the Context-as-Memory method [23]. - Experimental results show that Context-as-Memory outperforms baseline and state-of-the-art methods in memory capability and generation quality, demonstrating its effectiveness in maintaining long video consistency [24][25]. Generalization of the Method - The method's generalization was tested using various styles of images as initial frames, confirming its strong memory capabilities in open-domain scenarios [26][27]. Research Team and Background - The research was a collaboration between the University of Hong Kong, Zhejiang University, and Kuaishou, led by PhD student Yu Jiwen under Professor Liu Xihui [28][33].
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The article discusses the development of LaDi-WM (Latent Diffusion-based World Models), a novel world model that enhances robotic operation performance through predictive strategies, addressing the challenge of accurately predicting future states in robot-object interactions [1][5][28]. Group 1: LaDi-WM Overview - LaDi-WM utilizes pre-trained vision foundation models to create latent space representations that encompass both geometric and semantic features, facilitating strategy learning and cross-task generalization in robotic operations [1][5][10]. - The framework consists of two main phases: world model learning and policy learning, which iteratively optimizes action outputs based on predicted future states [9][12]. Group 2: Methodology - The world model learning phase involves extracting geometric representations using DINOv2 and semantic representations using Siglip, followed by an interactive diffusion process to enhance dynamic prediction accuracy [10][12]. - The policy model training incorporates future predictions from the world model as additional inputs, guiding the model to improve action predictions and reduce output distribution entropy over iterations [12][22]. Group 3: Experimental Results - In virtual experiments on the LIBERO-LONG dataset, LaDi-WM achieved a success rate of 68.7% with only 10 training trajectories, outperforming previous methods by a significant margin [15][16]. - The framework demonstrated strong performance in the CALVIN D-D dataset, completing tasks with an average length of 3.63, indicating robust capabilities in long-duration tasks [17][21]. - Real-world experiments showed a 20% increase in success rates for tasks such as stacking bowls and drawer operations, validating the effectiveness of LaDi-WM in practical scenarios [25][26]. Group 4: Scalability and Generalization - The scalability experiments indicated that increasing the training data for the world model led to reduced prediction errors and improved policy performance [18][22]. - The generalization capability of the world model was highlighted by its ability to guide policy learning across different environments, achieving better performance than models trained solely in the target environment [20][21].
CoRL 2025|隐空间扩散世界模型LaDi-WM大幅提升机器人操作策略的成功率和跨场景泛化能力
机器之心· 2025-08-17 04:28
Core Viewpoint - The article discusses the introduction of LaDi-WM (Latent Diffusion-based World Models), a novel world model that utilizes latent space diffusion to enhance robot operation performance through predictive strategies [2][28]. Group 1: Innovation Points - LaDi-WM employs a latent space representation constructed using pre-trained vision foundation models, integrating both geometric features (derived from DINOv2) and semantic features (derived from Siglip), which enhances the generalization capability for robotic operations [5][10]. - The framework includes a diffusion strategy that iteratively optimizes output actions by integrating predicted states from the world model, leading to more consistent and accurate action results [6][12]. Group 2: Framework Structure - The framework consists of two main phases: world model learning and policy learning [9]. - **World Model Learning**: Involves extracting geometric and semantic representations from observation images and implementing a diffusion process that allows interaction between these representations to improve dynamic prediction accuracy [10]. - **Policy Model Training and Iterative Optimization**: Utilizes future predictions from the world model to guide policy learning, allowing for multiple iterations of action optimization, which reduces output distribution entropy and enhances action prediction accuracy [12][18]. Group 3: Experimental Results - In extensive experiments on virtual datasets (LIBERO-LONG, CALVIN D-D), LaDi-WM demonstrated a significant increase in success rates for robotic tasks, achieving a 27.9% improvement on the LIBERO-LONG dataset, reaching a success rate of 68.7% with minimal training data [15][16]. - The framework's scalability was validated, showing that increasing training data and model parameters consistently improved success rates in robotic operations [18][20]. Group 4: Real-World Application - The framework was also tested in real-world scenarios, including tasks like stacking bowls and opening drawers, where LaDi-WM improved the success rate of original imitation learning strategies by 20% [24][25].
谢耘:诺奖得主辛顿敷衍走场,是对科学的败坏
Hu Xiu· 2025-08-04 05:57
Group 1 - The article discusses the contrasting views on artificial intelligence (AI), highlighting a divide between pessimistic and optimistic perspectives among experts [2][3][5] - It emphasizes that while AI can perform certain tasks, it lacks true understanding and reasoning capabilities, relying instead on statistical methods [7][8][10] - The article critiques the notion that AI's intelligence is akin to human intelligence, arguing that there are fundamental differences in understanding and reasoning [11][12][24] Group 2 - The lack of a solid scientific foundation for AI is noted, with historical references to Turing's work being described as subjective and not meeting scientific standards [10][12][14] - The article points out that AI's reliance on statistical methods has led to practical applications but does not equate to theoretical breakthroughs in science [15][17] - It suggests that AI is merely a part of the broader information technology landscape, which aims to enhance human capabilities rather than replace them [19][20][21] Group 3 - The historical context of technological development is discussed, indicating that reliance on empirical craftsmanship has limitations compared to scientific advancements [22][24] - The article warns against the potential for misinformation and the dilution of scientific rigor in the discourse surrounding AI, especially as society enters a "post-science" era [24][25] - It concludes that the aspiration to create machines with human-like consciousness remains unattainable without a deeper scientific understanding of consciousness itself [23][24]
硬蛋创新(00400):以边缘AI算力“Nvidia Jetson”为基石,赋能人形机器人赛道
智通财经网· 2025-07-28 11:55
Group 1 - Nvidia and Hede Innovation held an online seminar focusing on humanoid robots and their integrated hardware and software solutions [1] - The upcoming flagship platform, Jetson Thor, is set to launch in August, emphasizing edge AI computing for humanoid robots [1] - Nvidia's three computing platforms—DGX, Jetson, and Omniverse—provide a comprehensive solution for training, simulation optimization, and deployment of embodied robots [1] Group 2 - Humanoid robots are seen as a key hardware node for breakthroughs in embodied artificial intelligence, with global spending in the robotics sector projected to approach $370 billion by 2028, growing at a CAGR of 13.2% [2] - Hede Innovation is a core supplier in the AI computing supply chain, representing major brands like Nvidia, Intel, and Microsoft, and is focusing on the Jetson series for edge AI applications [2] - The performance of Hede Innovation is expected to benefit from the leadership of Nvidia Jetson products in the edge AI field, reinforcing its core position in the AI computing supply chain [3]
VLN-PE:一个具备物理真实性的VLN平台,同时支持人形、四足和轮式机器人(ICCV'25)
具身智能之心· 2025-07-21 08:42
Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]
港大强化学习驱动连续环境具身导航方法:VLN-R1
具身智能之心· 2025-07-04 09:48
Core Viewpoint - The article presents the VLN-R1 framework, which utilizes large vision-language models (LVLM) for continuous navigation in real-world environments, addressing limitations of previous discrete navigation methods [5][15]. Research Background - The VLN-R1 framework processes first-person video streams to generate continuous navigation actions, enhancing the realism of navigation tasks [5]. - The VLN-Ego dataset is constructed using the Habitat simulator, providing rich visual and language information for training LVLMs [5][6]. - The importance of visual-language navigation (VLN) is emphasized as a core challenge in embodied AI, requiring real-time decision-making based on natural language instructions [5]. Methodology - The VLN-Ego dataset includes natural language navigation instructions, historical frames, and future action sequences, designed to balance local details and overall context [6]. - The training method consists of two phases: supervised fine-tuning (SFT) to align action predictions with expert demonstrations, followed by reinforcement fine-tuning (RFT) to optimize model performance [7][9]. Experimental Results - In the R2R task, VLN-R1 achieved a success rate (SR) of 30.2% with the 7B model, significantly outperforming traditional models without depth maps or navigation maps [11]. - The model demonstrated strong cross-domain adaptability, outperforming fully supervised models in the RxR task with only 10K samples used for RFT [12]. - The design of predicting future actions was found to be crucial for performance, with the best results obtained by predicting six future actions [14]. Conclusion and Future Work - VLN-R1 integrates LVLM and reinforcement learning fine-tuning, achieving state-of-the-art performance in simulated environments and showing potential for small models to match larger ones [15]. - Future research will focus on validating the model's generalization capabilities in real-world settings and exploring applications in other embodied AI tasks [15].
机器人视觉语言导航进入R1时代!港大联合上海AI Lab提出全新具身智能框架
量子位· 2025-06-25 00:33
Core Insights - The article discusses the advancements in visual language navigation technology, specifically the VLN-R1 model developed by the University of Hong Kong and Shanghai AI Lab, which enables robots to navigate complex environments using natural language instructions without relying on discrete maps [1][3]. Group 1: Performance and Efficiency - VLN-R1 demonstrates strong performance in the VLN-CE benchmark, surpassing the results of larger models with only a 2 billion parameter model after RFT training [2]. - In long-distance navigation tasks, VLN-R1 showcases "cross-domain transfer," achieving superior performance with only 10,000 RxR samples after pre-training on R2R, highlighting its data efficiency [2][15]. Group 2: Innovation in Navigation - The core challenge of visual language navigation (VLN) is to enable agents to autonomously complete navigation tasks based on natural language commands while integrating real-time visual perception [3]. - Traditional navigation systems rely on discrete topological maps, limiting their adaptability to complex environments and dynamic changes [4][5]. Group 3: Training Mechanisms - VLN-R1 employs a two-stage training approach combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) to enhance decision-making capabilities [7]. - The model utilizes a group comparison optimization (GRPO) method to generate multiple action plans for the same instruction, optimizing strategies based on relative performance [7]. - A time decay reward (TDR) mechanism is introduced to prioritize immediate actions, ensuring the model focuses on current obstacles before planning future steps [8][9]. Group 4: Data Set and Memory Management - The VLN-Ego dataset, created using the Habitat simulator, includes 630,000 R2R and 1.2 million RxR training samples, emphasizing first-person perspectives and real-time decision-making [12]. - A long-short term memory sampling strategy is implemented to balance recent experiences with long-term memory, allowing the model to respond effectively to sudden changes in the environment [14]. Group 5: Future Implications - The research indicates that the key to embodied intelligence lies in creating a closed-loop learning system that mimics human perception, decision-making, and action [16]. - The framework's reproducibility and scalability are enhanced with the open availability of the VLN-Ego dataset and training methods, promoting the transition of AI from "digital intelligence" to "embodied cognition" across various applications [16].
博原资本携手银河通用成立“博银合创”,加速具身人工智能赋能工业自动化
投中网· 2025-06-18 02:21
Core Viewpoint - The establishment of "博银合创" marks a significant step towards the industrialization of embodied artificial intelligence in China, aiming to enhance global smart manufacturing through collaboration and innovation [1][22]. Group 1: Company Formation and Objectives - Bosch Group's investment platform, 博原资本, has partnered with leading Chinese embodied intelligence company, 银河通用, to form a joint venture named "博银合创" [1]. - The new company will focus on complex assembly and intelligent quality inspection, developing agile robots to promote the large-scale implementation of embodied AI in industrial settings [1][9]. - 博银合创 aims to create a complete growth path from early incubation to independent financing and commercialization, establishing a globally competitive smart manufacturing enterprise [9][15]. Group 2: Market Potential and Technological Advancements - According to the International Federation of Robotics (IFR), the global industrial robot market is expected to exceed $80 billion by 2025, with embodied intelligence-driven collaborative robots likely to capture over half of this market [5]. - Embodied AI integrates perception, cognition, and action capabilities, enabling robots to make autonomous decisions and execute tasks accurately in dynamic environments, thus driving the flexibility and intelligence of manufacturing [5][12]. Group 3: Strategic Collaborations and Innovations - 博银合创 has signed a strategic cooperation memorandum with UAES to establish a joint laboratory, "RoboFab," focusing on pilot applications of embodied AI in manufacturing [19]. - The collaboration aims to bridge the gap between foundational research and industrial practice, accelerating the development of reliable and efficient smart robot solutions [20]. - 博原资本's "博原启世" platform will play a crucial role in supporting the joint venture by facilitating resource integration and market expansion [14][22]. Group 4: Future Directions and Global Strategy - 博银合创 is positioned to explore a new paradigm of "global design, local manufacturing" in smart manufacturing, with plans for localized deployment in key manufacturing markets such as Europe, North America, and Southeast Asia [22]. - The company will continue to collaborate with industry partners to build an open and efficient industrial cooperation system, promoting the large-scale deployment of embodied AI in global manufacturing [22].
博原资本设立全资控股平台「博原启世」:已携手银河通用成立「博银合创」
IPO早知道· 2025-06-18 01:26
Core Viewpoint - The establishment of "博银合创" marks a significant step towards the industrialization of embodied artificial intelligence, focusing on complex manufacturing processes and the development of agile robots to enhance automation in the manufacturing sector [2][4][23]. Group 1: Company Initiatives - 博原资本 has launched a wholly-owned platform "博原启世" to strategically incubate and reconstruct the ecosystem of embodied artificial intelligence [2][12]. - A joint venture, "博银合创," has been formed with 银河通用 to focus on core manufacturing scenarios such as complex assembly and intelligent quality inspection [2][8]. - 博银合创 aims to create a complete growth path from early incubation to independent financing and commercialization, establishing a globally competitive intelligent manufacturing enterprise [9][14]. Group 2: Technological Advancements - The global industrial robot market is projected to exceed $80 billion by 2025, with embodied intelligence-driven collaborative robots expected to capture over half of this market [4]. - 博银合创 will leverage 银河通用's self-developed simulation training and synthetic data technology to build a standardized, modular training and deployment system for rapid iteration and large-scale deployment of robotic products [8][12]. - The company is positioned to address key challenges in traditional automation, focusing on high-complexity manufacturing processes that require flexible and precise solutions [8][11]. Group 3: Strategic Collaborations - 博银合创 has signed a strategic cooperation memorandum with UAES to establish a joint laboratory "RoboFab," focusing on pilot applications of embodied artificial intelligence in typical manufacturing processes [19][20]. - 博原启世 will facilitate connections between cutting-edge technology companies and industrial resources, expanding collaborative practices to create a tailored network for embodied artificial intelligence [15][21]. - The OpenBosch innovation platform will play a crucial role in the global collaboration system of 博原启世, providing scenario matching and pilot support for incubation projects [21]. Group 4: Future Outlook - 博原资本 plans to deepen its layout in key areas such as technology standards, production line modules, and data systems to promote localized deployment of embodied robots in major manufacturing markets like Europe, North America, and Southeast Asia [23][24]. - The future strategy includes building an open and efficient industrial cooperation system to facilitate the large-scale deployment of embodied artificial intelligence in global manufacturing [24].