原始性创新

Search documents
字节Seed首次开源代码模型,拿下同规模多个SOTA,提出用小模型管理数据范式
量子位· 2025-05-11 04:20
Core Viewpoint - ByteDance's Seed has released the Seed-Coder model, an 8 billion parameter code generation model that surpasses Qwen3 and achieves multiple state-of-the-art (SOTA) results in various benchmarks [1][7]. Model Overview - Seed-Coder consists of three versions: Base, Instruct, and Reasoning [6]. - The model has a context length of 32K and was trained using 6 trillion tokens, following a permissive MIT open-source license [10]. Data Management and Processing - The Seed-Coder model employs a "model-centered" data processing approach, utilizing the model to curate training data [12]. - The data filtering process involves several stages, including deduplication using SHA256 and MinHash algorithms, which reduced the original data volume by approximately 98% [15][16]. - A scoring model trained on over 220,000 code documents is used to filter low-quality code files, resulting in a corpus supporting 89 programming languages and containing around 1 trillion unique tokens [19]. Data Sources - Seed-Coder collected 74 million commit records from 140,000 high-quality GitHub repositories, with selection criteria including at least 100 stars, 10 forks, 100 commits, and 100 days of maintenance activity [21]. - The model also extracts data from web archives, identifying two types of raw data: HTML pages with clear code tags and those without, employing both precise and approximate deduplication techniques [27][28]. Pre-training Phases - The pre-training of Seed-Coder is divided into two phases: conventional pre-training using file-level code and code-related web data, and continuous pre-training that incorporates all data categories along with high-quality datasets to enhance performance [34][35]. Model Variants and Innovations - Two special variants of Seed-Coder have been developed to further expand its utility [36]. - ByteDance has also launched other models, including a video generation model (Seaweed) and a reasoning model (Seed-Thinking-v1.5), emphasizing cost-effectiveness and performance improvements [39][40]. Strategic Direction - ByteDance's Seed is focusing on open-source initiatives and lowering barriers to access, with ongoing adjustments within its AI Lab to explore foundational research in AGI [44].