DreamVLA - filings, earnings calls, financial reports, news

DreamVLA

Search documents

东方理工金鑫：如何找到自动驾驶与机器人统一的「空间语言」丨GAIR 2025

雷峰网· 2025-12-14 06:27

" 当AI拥有「思维链」，赋予机器想象力的世界模型训练新范式。 " 作者丨吴彤编辑丨林觉民在人工智能研究正以前所未有的速度迭代的今天，一位研究者如果同时聚焦于世界模型与具身智能这类高度前沿的课题，并且强调产业应用和市场接受度才是技术真正的试金石，这可能本身就成为了一种值得关注的信号。宁波东方理工大学助理教授金鑫便是这样一位研究者。我们近期的一次交流，恰逢他的团队在美国圣地亚哥NeurIPS会议的活动告一段落——他与上海交通大学、布里斯托大学、清华大学等高校的合作者们在那组织了一场关于"具身世界模型"（ Embodied World Models for Decision Making）的研讨会，并有多位学界和产业界大咖受邀参加并作报告。从早期的图像视频信号处理、压缩等底层视觉任务，到近年聚焦于表征解耦、世界模型、空间智能等方向，金鑫的研究不断从低维信息向高维信息跃迁，不断尝试新的挑战，试图让机器变得更加智能，更好地理解物理世界并服务实际产业，其研究路径也反映出AI领域逐渐从简单的感知走向更加复杂的认知与决策。然而，当对话触及这些光环之下的研究内核时，他表现出一种审慎。 "这只是我们团队现阶 ...

聊聊DreamVLA：让机器人先看后想再动

具身智能之心· 2025-08-11 00:14

Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].

具身智能

视觉-语言-动作模型（VLA）

世界知识（world knowledge）

世界知识（world knowledge）

逆动力学问题

DreamVLA

DINOv2

DreamVLA：全球首个“世界知识预测”VLA模型，操作成功率近八成

具身智能之心· 2025-07-10 13:16

Core Insights - The article discusses the potential of Vision-Language-Action (VLA) models in enhancing robotic operations through the integration of image generation and action prediction, highlighting the limitations of existing methods in forming a closed-loop perception-prediction-action cycle [3][16] - DreamVLA is introduced as a model that predicts comprehensive world knowledge to improve robotic performance, focusing on dynamic areas, depth perception, and high-level semantic features [4][5][16] Research Background and Motivation - Current VLA models are limited by image-based predictions, leading to information redundancy and a lack of critical world knowledge such as dynamics, spatial, and semantic understanding [3] - DreamVLA aims to construct a more effective perception-prediction-action loop by predicting comprehensive world knowledge, thereby enhancing the interaction between robots and their environment [3] Model Design Core Ideas - DreamVLA focuses on three core features: dynamic area prediction, depth perception, and high-level semantic features, which are essential for task execution [4][5] - Dynamic area prediction utilizes optical flow models to identify moving regions in a scene, optimizing the model's focus on task-critical areas [4] - Depth perception is achieved through depth estimation algorithms, providing 3D spatial context, while high-level semantic features are integrated from various visual models to enhance future state understanding [5] Structural Attention and Action Generation - A block structural attention mechanism is employed to separate queries into dynamic, depth, and semantic sub-queries, preventing cross-type knowledge leakage and maintaining clear representations [6] - The diffusion Transformer decoder is used to separate action representations from shared latent features, transforming Gaussian noise into action sequences through iterative self-attention and denoising processes [8] Experimental Results and Analysis - In benchmark tests, DreamVLA achieved an average task length of 4.44, outperforming other methods such as RoboVLM and Seer [9][10] - Real-world experiments with the Franka Panda robotic arm showed an average success rate of 76.7%, significantly higher than baseline models [10] Ablation Study Insights - The contribution of different knowledge types was analyzed, revealing that dynamic area prediction provided the most significant performance gain, while depth and semantic cues offered smaller, yet valuable, improvements [11] - Predicting future knowledge outperformed merely reconstructing current information, indicating that prediction provides better guidance for actions [12] - The block structural attention mechanism improved average task length from 3.75 to 4.44, demonstrating its effectiveness in reducing cross-signal interference [13] Core Contributions and Limitations - DreamVLA reconfigures VLA models into a perception-prediction-action framework, providing comprehensive foresight for planning through the prediction of dynamic, spatial, and high-level semantic information [16] - The model is currently limited to parallel gripper operations and relies on RGB data, with plans to incorporate more diverse data types and enhance generalization and robustness in future developments [15][16]