开源开放

Search documents
字节Seed首次开源代码模型,拿下同规模多个SOTA,提出用小模型管理数据范式
量子位· 2025-05-11 04:20
Core Viewpoint - ByteDance's Seed has released the Seed-Coder model, an 8 billion parameter code generation model that surpasses Qwen3 and achieves multiple state-of-the-art (SOTA) results in various benchmarks [1][7]. Model Overview - Seed-Coder consists of three versions: Base, Instruct, and Reasoning [6]. - The model has a context length of 32K and was trained using 6 trillion tokens, following a permissive MIT open-source license [10]. Data Management and Processing - The Seed-Coder model employs a "model-centered" data processing approach, utilizing the model to curate training data [12]. - The data filtering process involves several stages, including deduplication using SHA256 and MinHash algorithms, which reduced the original data volume by approximately 98% [15][16]. - A scoring model trained on over 220,000 code documents is used to filter low-quality code files, resulting in a corpus supporting 89 programming languages and containing around 1 trillion unique tokens [19]. Data Sources - Seed-Coder collected 74 million commit records from 140,000 high-quality GitHub repositories, with selection criteria including at least 100 stars, 10 forks, 100 commits, and 100 days of maintenance activity [21]. - The model also extracts data from web archives, identifying two types of raw data: HTML pages with clear code tags and those without, employing both precise and approximate deduplication techniques [27][28]. Pre-training Phases - The pre-training of Seed-Coder is divided into two phases: conventional pre-training using file-level code and code-related web data, and continuous pre-training that incorporates all data categories along with high-quality datasets to enhance performance [34][35]. Model Variants and Innovations - Two special variants of Seed-Coder have been developed to further expand its utility [36]. - ByteDance has also launched other models, including a video generation model (Seaweed) and a reasoning model (Seed-Thinking-v1.5), emphasizing cost-effectiveness and performance improvements [39][40]. Strategic Direction - ByteDance's Seed is focusing on open-source initiatives and lowering barriers to access, with ongoing adjustments within its AI Lab to explore foundational research in AGI [44].
“断供”阴影下,国产操作系统的破局时刻
Guan Cha Zhe Wang· 2025-05-08 14:22
Core Viewpoint - The key to the breakthrough of domestic operating systems lies not in forced imitation driven by "replacement anxiety," but in identifying advantageous scenarios based on "demand insight" and breaking technical barriers through open-source collaboration, thereby building irreplaceability in niche markets [1][3][4]. Group 1: Market Context - Domestic operating systems have made significant strides, particularly in the consumer electronics sector, with Huawei's new phones exclusively using "pure HarmonyOS" [3]. - The global IoT device count is expected to exceed 64 billion by 2025, with China's market size surpassing 4.5 trillion yuan, accounting for over 30% [3]. - The market is fragmented, with various giants positioning themselves differently: open-source HarmonyOS aims for "all-scenario unification," while Xiaomi's Vela targets smart homes, and Alibaba focuses on industrial internet [3][4]. Group 2: Technological Insights - RT-Thread, a domestic open-source embedded operating system, has achieved an installation base of over 2.5 billion units, making it the largest domestic embedded OS by installation volume [6][16]. - The choice of a microkernel and lightweight design for RT-Thread was driven by market demand, particularly for small chips that cannot run Linux or Android [6][14]. - The operating system's success is attributed to its ability to meet specific needs in fragmented scenarios, such as real-time response requirements in satellites and automotive applications [7][14]. Group 3: Competitive Landscape - The rise of domestic operating systems faces challenges, including weak developer ecosystems and difficulties in adapting to various chips and toolchains [7][19]. - RT-Thread differentiates itself by being a neutral and open third-party platform, collaborating with various manufacturers, and having a solid understanding of the domestic market [21][31]. - The competition includes major players like Huawei and Xiaomi, with RT-Thread focusing on foundational operating systems for vehicles and industrial applications [31][36]. Group 4: Future Outlook - The future of operating systems may involve either creating a self-contained ecosystem like Apple or adhering to an open-source model [7][25]. - The ongoing geopolitical tensions and the push for domestic alternatives are accelerating the development of domestic operating systems [7][24]. - RT-Thread is well-positioned to adapt to the RISC-V architecture, which is gaining traction as an open instruction set standard, enhancing its competitive edge [37][42].