Workflow
LeWM
icon
Search documents
LeCun的世界模型单GPU就能跑了
量子位· 2026-03-24 04:59
Core Insights - The article discusses the latest advancements in the LeCun world model, specifically the open-sourced LeWorldModel, which allows for extremely simplified training on a single GPU, achieving rapid planning in just one second [1][2]. Group 1: Model Architecture and Training - LeWorldModel (LeWM) is based on the JEPA architecture, enabling direct pixel input to predict future states with remarkable speed [2][3]. - The model simplifies the JEPA approach by using an encoder to convert images into latent features and a predictor to forecast the next features based on actions, employing Gaussian regularization to prevent collapse [6][11]. - The architecture consists of two core components: an encoder that compresses images into a small string of numbers (latent features) and a predictor that estimates the next features based on current features and intended actions [7][8]. Group 2: Performance Metrics - LeWM demonstrates superior performance in various tasks, achieving a 96% success rate in the Push-T task, which is 18% higher than the previous PLDM method and even surpasses the DINO-WM model with body input [17]. - In the Reacher task, LeWM outperforms PLDM and is comparable to DINO-WM, while in the OGBench-Cube task, it remains competitive despite slightly trailing DINO-WM [17]. - The model's planning speed is 48 times faster than DINO-WM, completing tasks in under one second compared to approximately 47 seconds for DINO-WM [19][20]. Group 3: Loss Functions and Training Simplification - The key innovation of LeWM lies in its use of only two loss functions: prediction loss, which encourages the predictor to accurately guess the next frame's features, and a regularization loss that enforces a standard Gaussian distribution on feature vectors to prevent model collapse [11][12]. - The total loss function is a combination of prediction loss and a weighted regularization loss, with the regularization weight being the only hyperparameter that requires tuning, significantly simplifying the training process [13]. Group 4: Experimental Results and Insights - Experimental results indicate that LeWM outperforms the previous end-to-end JEPA method (PLDM) and matches or exceeds the performance of DINO-WM, while being easier to train, faster, and requiring fewer parameters [14]. - The model effectively captures the core structure and dynamics of the environment, accurately predicting object movements and identifying "physically impossible" scenarios [24][25]. - In experiments with visual and physical disturbances, the model reacted differently, showing surprise at physical violations while remaining indifferent to mere color changes [26][28].