把“会跑的代码世界”装进AI，Meta重磅开源首个代码世界模型：让AI像程序员一样思考

Core Insights - Meta's FAIR team has launched the Code World Model (CWM), a large language model (LLM) with 32 billion parameters and a context length of up to 131k tokens, aimed at integrating "world model" concepts into code generation and reasoning [1][2][3] - CWM is designed to not only write code but also simulate code execution, reason about program states, and self-detect and fix bugs, enhancing the model's understanding of code execution [2][3] Training Phases - The training of CWM is divided into three main phases: - Pre-training with 8 trillion tokens, where approximately 30% are code-related [3][4] - Mid-training, which incorporates 5 trillion tokens of world modeling data, extending the context length to 131k tokens [4][6] - Post-training (SFT + RL), involving 100 billion tokens for instruction and reasoning capabilities, followed by large-scale multi-task reinforcement learning with 172 billion tokens [4][10] Data Utilization - CWM's world model capabilities are driven by two main types of data during mid-training: - Execution traces from Python, which help the model learn how code execution alters local states [6][8] - Interaction trajectories from an automated agent that executes tasks in a repository, collecting around 3 million trajectories from 10.2k images and 3.15k repositories [9] Performance Metrics - In benchmark tests, CWM demonstrated strong performance, achieving 65.8% pass@1 on SWE-bench Verified with Test-Time-Scaling enabled, and notable results on LiveCodeBench (68.6%), Math-500 (96.6%), and AIME 2024 (76.0%) [10][12] - CWM's performance is competitive with larger or closed-source LLMs, nearing GPT-4 levels, although it has limitations in certain editing formats and multi-language scenarios [12] Industry Reception - The release of CWM has garnered significant attention, with Meta's AI researchers actively promoting it, highlighting its potential impact on software development [13][15] - While the open-sourcing of CWM's training checkpoints is praised for its utility in academic and engineering replication, there are concerns regarding the model's computational demands and the need for practical testing in real development environments [15]