工业代码能力开源第一！北航团队用真实仿真环境生成250万条验证数据，专治工业编码「水土不服」

Core Insights - The article discusses the limitations of general-purpose code models in handling industrial code, emphasizing that the challenges stem from a lack of understanding of hardware semantics and specific language constructs required in industrial programming [3][6][10]. Group 1: InCoder-32B Model Overview - InCoder-32B is the first code base model specifically designed for industrial code, featuring a 32 billion parameter Decoder-only Transformer architecture [9][10]. - The model aims to unify multiple industrial code domains, including chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling, while maintaining competitiveness in general coding tasks [10][22]. Group 2: Data Production and Validation - The model was trained on 2.5 million execution-validated samples, produced through a four-step process: task construction, candidate generation, execution validation, and feedback-driven repair [16][19]. - The team established four types of industrial simulation environments to ensure high-quality training data, replicating the actual tools and execution semantics used by industrial engineers [13][14][15]. Group 3: Training Methodology - InCoder-32B employs a three-stage progressive training approach, starting with pre-training on 15 trillion tokens, followed by mid-term training to expand context, and concluding with specialized training using the validated industrial code data [22][25]. Group 4: Model Performance - InCoder-32B achieved significant breakthroughs in industrial code benchmarks, with a VeriScope score of 80.7 and a fix rate of 80.0% in chip design [25]. - The model also demonstrated strong performance in general coding benchmarks, maintaining a competitive edge with scores like 94.5% in HumanEval and 91.8% in MBPP [25]. Group 5: Error Analysis - A systematic analysis of 1,882 failure samples identified five core issues: compilation and syntax errors, insufficient industrial API knowledge, functional correctness issues, output format violations, and performance optimization shortcomings [26][28]. - The most common failure type was compilation and syntax errors, particularly in chip design, where 71% of failures were attributed to format errors and mismatched port declarations [27]. Group 6: Open Source Information - The model and its code have been open-sourced on Hugging Face and GitHub under the Apache 2.0 license, promoting accessibility and collaboration within the community [29].