VinciCoder：多模态统一代码生成框架和视觉反馈强化学习，数据代码模型权重已开源

Core Insights - The article discusses the limitations of traditional supervised fine-tuning (SFT) in multimodal code generation and introduces VinciCoder, a unified model that leverages visual reinforcement learning (ViRL) to enhance visual fidelity and code executability [2][6][22] - VinciCoder employs a two-phase strategy combining large-scale SFT with coarse-to-fine ViRL to address the challenges faced by existing models in generating diverse code from various visual inputs [2][7][22] Limitations of Traditional SFT - Traditional SFT suffers from a "visual gap" between training objectives and final tasks, leading to issues such as local optimization that fails to ensure global code executability and a lack of visual feedback during training [6][13] - The absence of visual feedback is critical, as minor code modifications can lead to significant changes in rendered images, highlighting the need for a mechanism that provides global visual feedback [6][7] VinciCoder's Approach - VinciCoder's innovation lies in shifting the reward mechanism from the text domain to the visual domain, utilizing a large-scale SFT to build foundational code capabilities, followed by a ViRL phase to optimize visual fidelity and executability [7][12] - The training framework consists of a "1.6M large-scale SFT phase" and a "42k coarse-to-fine ViRL phase," enabling strong code understanding and high-fidelity visual alignment [7][12] Large-Scale SFT and Code Optimization - The research team created a large-scale SFT corpus containing 1.6 million image-code pairs, which includes a new task of "visual code optimization" where the model corrects defective code to align with target images [10][12] Coarse-to-Fine ViRL Framework - VinciCoder introduces a coarse-to-fine visual reward mechanism that directly derives reward signals from visual outputs, addressing the lack of "visual-code" feedback in traditional SFT [12][14] - The framework evaluates visual similarity at both global (coarse) and local (fine) levels, enhancing the model's ability to generate accurate code [14] Experimental Results - VinciCoder demonstrated superior performance across multiple multimodal code generation benchmarks, outperforming both open-source and closed-source models, establishing new state-of-the-art (SOTA) standards [16][18] - The model's performance in challenging tasks, such as Image-to-SVG and chemical formula generation, rivals that of top closed-source models, showcasing its effectiveness [16][18] Research Significance and Future Applications - The research presents a new paradigm for multimodal code generation, emphasizing the importance of visual feedback in guiding code generation processes [19][20] - VinciCoder's success illustrates the potential of reinforcement learning to bridge the gap between visual and code modalities, paving the way for future developments in generalized multimodal intelligence [20][22]