Workflow
世界模型
icon
Search documents
本来决定去具身,现在有点犹豫了。。。
自动驾驶之心· 2025-07-05 09:12
Core Insights - The article discusses the evolving landscape of embodied intelligence, highlighting its transition from a period of hype to a more measured approach as the technology matures and is not yet at a productivity stage [2]. Group 1: Industry Trends - Embodied intelligence has gained significant attention over the past few years, but the industry is now recognizing that it is still in the early stages of development [2]. - There is a growing demand for skills in multi-sensor fusion and robotics, particularly in areas like SLAM and ROS, which are crucial for engaging with embodied intelligence [3][4]. - Many companies in the robotics sector are rapidly developing, with numerous startups receiving substantial funding, indicating a positive outlook for the industry in the coming years [3][4]. Group 2: Job Market and Skills Development - The job market for algorithm positions is competitive, with a focus on cutting-edge technologies such as end-to-end models, VLA, and reinforcement learning [3]. - Candidates with a background in robotics and a solid understanding of the latest technologies are likely to find opportunities, especially as traditional robotics remains a primary product line [4]. - The article encourages individuals to enhance their technical skills in robotics and embodied intelligence to remain competitive in the job market [3][4]. Group 3: Community and Resources - The article promotes a community platform that offers resources for learning about autonomous driving and embodied intelligence, including video courses and job postings [5]. - The community aims to gather a large number of professionals and students interested in smart driving and embodied intelligence, fostering collaboration and knowledge sharing [5]. - The platform provides access to the latest industry trends, technical discussions, and job opportunities, making it a valuable resource for those looking to enter or advance in the field [5].
想清楚再动手:具身智能也要学会脑补未来和择优执行 | RSS 2025
机器之心· 2025-07-05 05:53
Core Viewpoint - The article discusses the development of a new framework called FOREWARN, which combines world models and multimodal language reasoning to enhance the deployment intelligence of robotic systems, enabling them to make real-time decisions without additional data collection [5][21]. Group 1: Research Background - The first author, Wu Yilin, is a second-year PhD student at Carnegie Mellon University, focusing on object manipulation and lifelong learning in robotics [1]. - The second author, Tian Ran, is a PhD candidate at UC Berkeley and a research scientist at NVIDIA, working on the safe and reliable application of foundational models in robotics [2]. Group 2: Challenges in Deployment Intelligence - Current embodied intelligence models often struggle in real-world deployments due to their inability to adapt to environmental disturbances and user preference variations, leading to execution failures [3][21]. - The two main challenges in deployment are predicting the future consequences of actions and evaluating the predicted outcomes against task goals and user preferences [8][10]. Group 3: FOREWARN Framework - The FOREWARN framework consists of two modules: Foresight (simulating future outcomes) and Forethought (evaluating those outcomes), allowing for a more structured decision-making process [11]. - The system uses a world model to predict environmental changes based on candidate actions and employs a fine-tuned multimodal language model to interpret these predictions semantically [12][18]. Group 4: Innovation Highlights - The framework achieves cross-modal alignment between the world model's predictions and the language model's understanding, facilitating a closed-loop reasoning process from perception to decision-making [18]. - FOREWARN automates the decision-making process, significantly reducing deployment barriers and labor costs by enabling real-time selection of optimal action plans [19]. Group 5: Performance Evaluation - The introduction of the FOREWARN framework improved the success rate of robotic tasks from below 30% to 70%-80%, demonstrating its effectiveness in adapting to changing task instructions and user preferences [21]. - Even under varying conditions, the system maintained a success rate of 60%-80%, showcasing its robustness and adaptability [21]. Group 6: Future Directions - The research team identifies three challenges for broader application: enhancing the diversity and generalization of underlying strategies, addressing data scarcity issues, and optimizing reasoning efficiency and computational costs [23]. - The ongoing advancements in multimodal language models and world models are expected to further enhance the deployment intelligence of robots, enabling them to autonomously select safe and reasonable operational plans based on natural language instructions [23].
750城市+5000小时第一人称视频,上海AI Lab开源面向世界探索高质量视频数据集
量子位· 2025-07-05 04:03
Core Viewpoint - The Sekai project aims to create a high-quality video dataset that serves as a foundation for interactive video generation, visual navigation, and video understanding, emphasizing the importance of high-quality data in building world models [1][2]. Group 1: Project Overview - The Sekai project is a collaborative effort involving institutions like Shanghai AI Lab, Beijing Institute of Technology, and Tokyo University, focusing on world exploration through a continuously iterated high-quality video dataset [2]. - The dataset includes over 5000 hours of first-person walking and drone footage from more than 750 cities across 101 countries, featuring detailed labels such as text descriptions, location, weather, time, crowd density, scene type, and camera trajectory [2][10]. Group 2: Dataset Composition - Sekai consists of two complementary datasets: Sekai-Real, which focuses on real-world videos sourced from YouTube, and Sekai-Game, which includes high-fidelity game footage [3]. - Sekai-Real was created from over 8600 hours of YouTube videos, ensuring a minimum resolution of 1080P and a frame rate above 30FPS, with all videos published within the last three years [3][5]. - Sekai-Game was developed using over 60 hours of gameplay from the high-fidelity game "Lushfoil Photography Sim," capturing realistic lighting effects and consistent image formats [3][9]. Group 3: Data Processing and Quality Control - The data collection process involved gathering 8623 hours of video from YouTube and over 60 hours from games, followed by a preprocessing phase that resulted in 6620 hours of Sekai-Real and 40 hours of Sekai-Game [5][6]. - Video annotation for Sekai-Real utilized large visual language models for efficient labeling, while the dataset underwent rigorous quality control measures, including brightness assessment and video quality scoring [7][8]. - The final dataset features segments ranging from 1 minute to nearly 6 hours, with an average length of 18.5 minutes, and includes structured location information and detailed content classification [10]. Group 4: Future Goals - The Sekai team aims to leverage this dataset to advance world modeling and multimodal intelligence, supporting applications in world generation, video understanding, and autonomous navigation [10].
最新综述:从物理模拟器和世界模型中学习具身智能
具身智能之心· 2025-07-04 09:48
Core Insights - The article focuses on the advancements in embodied intelligence within robotics, emphasizing the integration of physical simulators and world models as crucial for developing robust embodied AI systems [4][6]. - It highlights the importance of a unified grading system for intelligent robots, which categorizes their capabilities from basic mechanical execution to advanced social intelligence [6][67]. Group 1: Embodied Intelligence and Robotics - Embodied intelligence is defined as the ability of robots to interact with the physical world, enabling perception, action, and cognition through physical feedback [6]. - The integration of physical simulators provides a controlled environment for training and evaluating robotic agents, while world models enhance the robots' internal representation of their environment for better prediction and decision-making [4][6]. - The article maintains a resource repository of the latest literature and open-source projects to support the development of embodied AI systems [4]. Group 2: Grading System for Intelligent Robots - The proposed grading model includes five progressive levels (IR-L0 to IR-L4), assessing autonomy, task handling, and social interaction capabilities [6][67]. - Each level reflects the robot's ability to perform tasks, from complete reliance on human control (IR-L0) to fully autonomous social intelligence (IR-L4) [6][67]. - The grading system aims to provide a unified framework for evaluating and guiding the development of intelligent robots [6][67]. Group 3: Physical Simulators and World Models - Physical simulators like Isaac Sim utilize GPU acceleration for high-fidelity simulations, addressing data collection costs and safety issues [67]. - World models, such as diffusion models, enable internal representation for predictive planning, bridging the gap between simulation and real-world deployment [67]. - The article discusses the complementary roles of simulators and world models in enhancing robotic capabilities and operational safety [67]. Group 4: Future Directions and Challenges - The future of embodied intelligence involves developing structured world models that integrate machine learning and AI to improve adaptability and generalization [68]. - Key challenges include high-dimensional perception, causal reasoning, and real-time processing, which need to be addressed for effective deployment in complex environments [68]. - The article suggests that advancements in 3D structured modeling and multimodal integration will be critical for the next generation of intelligent agents [68].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-03 13:36
职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支撑,并 探索其在自动驾驶和通用机器人领域的统一应用潜力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对 ...
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].
中国汽车的“爷爷”长啥样?70年变迁,竟然只在一瞬间!
电动车公社· 2025-07-02 15:59
Core Viewpoint - The article emphasizes the evolution of the Chinese automotive industry, highlighting its journey from manual craftsmanship to becoming the world's largest producer and exporter of automobiles, and the current advancements in technology and culture within the sector [1]. Group 1 - The Beijing Automobile Museum serves as a platform to reflect on the history of Chinese automotive development and its cultural roots [1]. - The article mentions the significance of national-level models in understanding the progress of the automotive industry in China [1]. - There is a focus on the future of new energy vehicles and the direction of automotive culture in China [1]. Group 2 - The article introduces recent vehicle launches, specifically mentioning the Xiaopeng G7, indicating ongoing innovation in the market [3]. - It discusses the new national standards for batteries, suggesting regulatory changes that could impact the industry [3]. - The concept of world models and the underlying logic of AI and intelligent driving are explored, indicating a shift towards advanced technology in automotive operations [3].
RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 10:18
Core Viewpoint - The article discusses the development of RoboScape, a physics-informed embodied world model that enhances video generation quality by integrating physical knowledge into the modeling process, addressing limitations in existing models related to physical perception and object manipulation [4][23]. Research Background and Core Issues - Existing models in embodied intelligence face significant limitations in physical perception, particularly in robot scenarios involving contact, leading to unrealistic object deformation and motion discontinuities [4]. - Current attempts to integrate physical knowledge are categorized into three types: physical prior regularization, knowledge distillation from physical simulators, and material field modeling, each with its own limitations [4]. Core Method - The focus is on learning an embodied world model as a dynamic function to predict the next visual observation based on past observations and robot actions [5]. Robot Data Processing Pipeline - A four-step processing pipeline is designed to create a multimodal dataset with physical priors based on the AGIBOT-World dataset [6]. RoboScape Model Architecture - The model utilizes a self-regressive Transformer framework for controllable robot video generation, integrating physical knowledge through two auxiliary tasks: physical attribute labeling and video slicing [8]. Time Depth Prediction - A time depth prediction branch is added to enhance 3D geometric consistency, employing a dual-branch cooperative self-regressive Transformer [10]. Adaptive Keypoint Dynamic Learning - The model employs self-supervised tracking of contact-driven keypoints to implicitly encode material properties, adapting to the most active keypoints based on motion amplitude [11]. Joint Training Objective - The overall training objective integrates various loss functions to balance the contributions of different components [13]. Experimental Validation - The model's performance is evaluated across three dimensions: appearance fidelity, geometric consistency, and action controllability, showing superior results compared to baseline models [15]. Dataset and Implementation Details - The dataset comprises 50,000 video segments covering 147 tasks and 72 skills, with training conducted on 32 NVIDIA A800 GPUs over five epochs [16]. Downstream Application Validation - In robot policy training, the model demonstrates performance close to real data training results, indicating the effectiveness of synthetic data for complex tasks [19]. Conclusion and Future Plans - RoboScape effectively integrates physical knowledge into video generation without relying on external physics engines, with plans to combine generative world models with real robots for further validation in practical scenarios [23][24].
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-01 12:07
核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对物理世界的空间推理,生成安全、合 理、可解释的行动序列。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支 ...
“三年实现商业化”,哈啰如何跑通Robotaxi?
Core Insights - The article discusses the competitive landscape of the Robotaxi industry, highlighting the shift from technology development to commercialization and scaling [1] - Ha Luo's entry into the Robotaxi market is supported by its user data and local operational experience, as well as a significant investment partnership with Ant Group and CATL [2][6] - The company aims to achieve commercialization within three years, focusing initially on the domestic market before expanding internationally [9][15] Company Strategy - Ha Luo plans to adopt a differentiated competition strategy by creating a multi-layered, accessible operational platform that integrates various car manufacturers and technology partners [4] - The platform will allow for resource sharing among partners, reducing operational costs and lowering the barriers for cities to implement Robotaxi services [4] - The company emphasizes the importance of data acquisition, particularly focusing on long-tail data to enhance model training for autonomous driving [5] Investment and Partnerships - The joint venture with Ant Group and CATL involves an initial investment of over 3 billion yuan, aimed at advancing L4 autonomous driving technology [2][6] - Ant Group will contribute to AI infrastructure and algorithm research, while CATL will provide battery technology and operational support [7] Technical Development - Ha Luo acknowledges the challenges in developing L4 technology, particularly in acquiring functional cases and long-tail data [9] - The company is exploring a dual approach to technology, combining AI-driven methods with traditional sensor technologies like LiDAR for enhanced reliability [13][14] Market Positioning - The company positions itself as a latecomer with unique advantages, leveraging the maturity of the industry to make targeted investments [3] - Ha Luo aims to create a commercially viable L4 product that is not only technologically sound but also economically feasible for consumers [8][12]