SmolLM3 - filings, earnings calls, financial reports, news

SmolLM3

Search documents

HuggingFace发布超200页「实战指南」，从决策到落地「手把手」教你训练大模型

3 6 Ke· 2025-11-09 23:58

Core Insights - HuggingFace released an extensive technical blog detailing the end-to-end experience of training advanced LLMs, emphasizing the "chaotic reality" of LLM development [1][4] Group 1: Training Considerations - The blog raises a critical question before diving into technical details: "Do you really need to train this model?" [7] - It lists common misconceptions for training models, such as having idle computing power or following trends, and provides a flowchart to help determine the necessity of training a custom model [9] - Custom pre-training is suitable for three main areas: research, production, and strategic open-source initiatives [12][13] Group 2: Team Dynamics - Successful LLM training teams typically possess two key traits: a small initial team size (2-3 people) and sufficient computational resources for rapid iteration [14] Group 3: Experimental Approach - The blog emphasizes the importance of conducting numerous experiments (ablation studies) to inform decisions on architecture, optimizers, and data combinations [15] - A structured process for setting up ablation experiments is outlined, recommending starting with a proven architecture to leverage existing optimizations [16] Group 4: Model Architecture - The blog details the decision-making process for designing LLM architectures, using the SmolLM3 model as an example, which has 3 billion parameters [25] - It discusses the trade-offs between dense, MoE (Mixture of Experts), and hybrid architectures, ultimately opting for a dense architecture for SmolLM3 due to deployment constraints [28] Group 5: Data Management - Data quality is highlighted as a critical factor in LLM training, often outweighing the importance of model architecture [30][31] - The blog discusses the evolution from static data mixing to multi-stage training, where data proportions are dynamically adjusted based on performance [34] Group 6: Training Infrastructure - The blog stresses the importance of robust infrastructure for LLM training, likening it to an industrial-grade oven necessary for baking a cake [50] - It provides insights into GPU requirements for training, using the SmolLM3 model as a case study, which utilized 384 H100 GPUs over nearly a month [54] Group 7: Post-Training Phase - The blog outlines the post-training phase, emphasizing the need for clear objectives and the selection of appropriate frameworks and tools [43][46] - It discusses the significance of supervised fine-tuning as a starting point for most post-training processes [49]

Large Language Model (LLM) Training

Artificial Intelligence

SmolLM3

Large Language Model (LLM) Training

Artificial Intelligence

SmolLM3

HuggingFace发布超200页「实战指南」，从决策到落地「手把手」教你训练大模型

机器之心· 2025-11-09 11:48

Core Insights - HuggingFace recently published an extensive technical blog detailing the end-to-end experience of training advanced LLMs, emphasizing the "chaotic reality" of LLM development [1][4] - The blog provides in-depth technical details, code snippets, and debugging tips, making it a valuable resource for readers interested in building LLMs [5] Group 1: Training Considerations - A critical question posed is whether one truly needs to train a model from scratch, given the availability of world-class open-source models [9] - The article lists common misconceptions about training models, such as having idle computing power or following trends without a clear purpose [11] - A flowchart is provided to help determine if training a custom model is necessary, suggesting that training should only be considered when existing models and fine-tuning do not meet specific needs [12][14] Group 2: Custom Pre-training Scenarios - Custom pre-training is suitable for three main areas: research, production, and strategic open-source initiatives [15] - The goals of these areas dictate training decisions, such as model size and architecture [17] - The decision-making process involves planning and validation through systematic experiments [18] Group 3: Team Composition and Experimentation - Successful LLM training teams typically start small, with 2-3 members, focusing on sufficient computing power and rapid iteration [19] - The blog emphasizes the importance of empirical experimentation, particularly through ablation studies, to inform model decisions [21][30] - A complete process for setting up ablation experiments is outlined, recommending starting with a proven architecture [22] Group 4: Framework Selection and Data Management - Choosing the right training framework is crucial, balancing functionality, stability, and throughput [24] - The article compares several mainstream frameworks, highlighting the importance of high-quality data management for effective training [25] - Data curation is described as an art, where the quality and mix of data significantly influence model performance [41][42] Group 5: Model Architecture and Tokenization - The blog discusses various model architectures, including dense, MoE (Mixture of Experts), and hybrid models, with SmolLM3 using a dense architecture for memory constraints [36][37] - Tokenization is highlighted as a critical factor, with the choice of vocabulary size and algorithm impacting model performance [38] - The article stresses the need for careful selection of hyperparameters tailored to specific architectures and datasets [39] Group 6: Training Process and Infrastructure - The training process is likened to a marathon, requiring thorough preparation and the ability to handle unexpected challenges [51] - Infrastructure is emphasized as a critical component often overlooked, with detailed considerations for GPU selection and monitoring [63][66] - The blog provides insights into the GPU requirements for training SmolLM3, illustrating the balance between training time, cost, and efficiency [70] Group 7: Post-training and Evaluation - The post-training phase is crucial for refining the model's capabilities, with specific goals outlined for the SmolLM3 model [55][58] - The article discusses the importance of selecting appropriate frameworks and tools for post-training, including supervised fine-tuning and reinforcement learning [60] - Evaluation metrics and continuous monitoring are essential for assessing model performance and ensuring improvements [64]

Large Language Model (LLM) Training

Artificial Intelligence

SmolLM3

Large Language Model (LLM) Training

Artificial Intelligence

SmolLM3

硬核拆解大模型，从 DeepSeek-V3 到 Kimi K2 ，一文看懂 LLM 主流架构

机器之心· 2025-08-07 09:42

Core Viewpoint - The article discusses the evolution of large language models (LLMs) over the past seven years, highlighting that while model capabilities have improved, the overall architecture has remained consistent. It questions whether there have been any disruptive innovations or if advancements have been incremental within the existing framework [2][5]. Group 1: Architectural Innovations - The article details eight mainstream LLMs, including DeepSeek and Kimi, analyzing their architectural designs and innovative approaches [5]. - DeepSeek V3, released in December 2024, introduced key architectural technologies that enhanced computational efficiency, distinguishing it among other LLMs [10][9]. - The multi-head latent attention mechanism (MLA) is introduced as a memory-saving strategy that compresses key and value tensors into a lower-dimensional latent space, significantly reducing memory usage during inference [18][22]. Group 2: Mixture-of-Experts (MoE) - The MoE layer in the DeepSeek architecture allows for multiple parallel feedforward submodules, significantly increasing the model's parameter capacity while reducing computational costs during inference through sparse activation [23][30]. - DeepSeek V3 features 256 experts in each MoE module, with a total parameter count of 671 billion, but only activates 9 experts per token during inference [30]. Group 3: OLMo 2 and Its Design Choices - OLMo 2 is noted for its high transparency in training data and architecture, which serves as a reference for LLM development [32][34]. - The architecture of OLMo 2 includes a unique normalization strategy, utilizing RMSNorm and QK-norm to enhance training stability [38][46]. Group 4: Gemma 3 and Sliding Window Attention - Gemma 3 employs a sliding window attention mechanism to reduce memory requirements for key-value (KV) caching, representing a shift towards local attention mechanisms [53][60]. - The architecture of Gemma 3 also features a dual normalization strategy, combining Pre-Norm and Post-Norm approaches [62][68]. Group 5: Mistral Small 3.1 and Performance - Mistral Small 3.1, released in March 2023, outperforms Gemma 3 in several benchmarks, attributed to its custom tokenizer and reduced KV cache size [73][75]. - Mistral Small 3.1 adopts a standard architecture without the sliding window attention mechanism used in Gemma 3 [76]. Group 6: Llama 4 and MoE Adoption - Llama 4 incorporates MoE architecture, similar to DeepSeek V3, but with notable differences in the activation of experts and overall design [80][84]. - The MoE architecture has seen significant development and adoption in 2025, indicating a trend towards more complex and capable models [85]. Group 7: Kimi K2 and Its Innovations - Kimi K2, with a parameter count of 1 trillion, is recognized as one of the largest LLMs, utilizing the Muon optimizer variant for improved training performance [112][115]. - The architecture of Kimi K2 is based on DeepSeek V3 but expands upon its design, showcasing the ongoing evolution of LLM architectures [115].

昆仑万维发布并开源Skywork-R1V 3.0版本；浙江大学发布高精准基因组设计AI模型丨AIGC日报

创业邦· 2025-07-10 00:00

Group 1 - Kunlun Wanwei released and open-sourced Skywork-R1V 3.0, achieving a score of 76.0 in the comprehensive multimodal evaluation MMMU, surpassing closed-source models like Claude-3.7-Sonnet (75.0) and GPT-4.5 (74.4), nearing the level of human junior experts (76.2) [1] - Hugging Face announced the release and open-sourcing of the small parameter model SmolLM3, which supports six languages and features a 128k context window, enabling deep and non-deep reasoning modes [1] - Zhejiang University developed a deep learning AI model named "Nuwa CE" for genomic prediction design, achieving over 90% accuracy in predicting phenotypic changes due to mutations in genomic regulatory regions, with results published in the journal Cell [1] Group 2 - Hugging Face's desktop robot Reachy Mini is now available for order, featuring two versions: Reachy Mini Wireless priced at $449 (approximately 3224 RMB) and Reachy Mini Lite at $299 (approximately 2147 RMB), both designed for developers [1][2] - Both versions of Reachy Mini are open-source DIY kits, comparable in size to a plush toy, equipped with screens and antenna structures, allowing users to program via Python and access over 1.7 million AI models and 400,000 datasets through the Hugging Face Hub [2]

Kunlun(SZ:300418)

AIGC

Artificial Intelligence

Artificial Intelligence

腾讯研究院· 2025-07-09 14:49

Group 1: Veo 3 Upgrade - The Google Veo 3 upgrade allows audio and video generation from a single image, maintaining high consistency across multiple angles [1] - The new feature is implemented through the Flow platform's "Frames to Video" option, enhancing camera movement capabilities, although the Gemini Veo3 entry is currently unavailable [1] - User tests indicate natural expressions and effective performances, marking a significant breakthrough in AI storytelling applicable in advertising and animation [1] Group 2: Hugging Face 3B Model - Hugging Face has released the open-source 3B parameter model SmolLM3, outperforming Llama-3.2-3B and Qwen2.5-3B, supporting a 128K context window and six languages [2] - The model features a dual-mode system allowing users to switch between deep thinking and non-thinking modes [2] - It employs a three-stage mixed training strategy, trained on 11.2 trillion tokens, with all technical details, including architecture and data mixing methods, made available [2] Group 3: Kunlun Wanwei Skywork-R1V 3.0 - Kunlun Wanwei has open-sourced the Skywork-R1V 3.0 multimodal model, achieving a score of 142 in high school mathematics and 76 in MMMU evaluation, surpassing some closed-source models [3] - The model utilizes a reinforcement learning strategy (GRPO) and key entropy-driven mechanisms, achieving high performance with only 12,000 supervised samples and 13,000 reinforcement learning samples [3] - It excels in physical reasoning, logical reasoning, and mathematical problem-solving, setting a new performance benchmark for open-source models and demonstrating cross-disciplinary generalization capabilities [3] Group 4: Vidu Q1 Video Creation - Vidu Q1's multi-reference video feature allows users to upload up to seven reference images, enabling strong character consistency and zero storyboard video generation [4] - Users can combine multiple subjects with simple prompts, with clarity upgraded to 1080P, and support for character material storage for repeated use [5] - Test results show it is suitable for creating multi-character animation trailers, supporting frame extraction and quality enhancement, reducing video production costs to less than 0.9 yuan per video [5] Group 5: VIVO BlueLM-2.5-3B Model - VIVO has launched the BlueLM-2.5-3B edge multimodal model, which excels in over 20 evaluations and supports GUI interface understanding [6] - The model allows flexible switching between long and short thinking modes, introducing a thinking budget control mechanism to optimize reasoning depth and computational cost [6] - It employs a sophisticated structure (ViT+Adapter+LLM) and a four-stage pre-training strategy, enhancing efficiency and mitigating the text capability forgetting issue in multimodal models [6] Group 6: DeepSeek-R1 System - The X-Masters system, developed by Shanghai Jiao Tong University and DeepMind Technology, has achieved a score of 32.1 in the "Human Last Exam" (HLE), surpassing OpenAI and Google [7] - The system is built on the DeepSeek-R1 model, enabling smooth transitions between internal reasoning and external tool usage, using code as an interactive language [7] - X-Masters employs a decentralized-stacked multi-agent workflow, enhancing reasoning breadth and depth through collaboration among solvers, critics, rewriters, and selectors, with the solution fully open-sourced [7] Group 7: Zhihui Jun's Acquisition - Zhihui Jun's Zhiyuan Robot has acquired control of the listed company Shuangwei New Materials for 2.1 billion yuan, aiming for a 63.62%-66.99% stake [8] - Following the acquisition, Shuangwei New Materials' stock resumed trading with a limit-up, reaching a market value of 3.77 billion yuan, with the actual controller changing to Zhiyuan CEO Deng Taihua and core team members including "Zhihui Jun" Peng Zhihui [8] - This acquisition, conducted through "agreement transfer + active invitation," is seen as a landmark case for new productivity enterprises in A-shares following the implementation of national policies [8] Group 8: AI Model Usage Trends - In the first half of 2025, the Gemini series models captured nearly half of the large model API market, with Google leading at 43.1%, followed by DeepSeek and Anthropic at 19.6% and 18.4% respectively [9] - DeepSeek V3 has maintained a high user retention rate since its launch, ranking among the top five in usage, while OpenAI's model usage has fluctuated significantly [9] - The competitive landscape shows differentiation: Claude-Sonnet-4 leads in programming (44.5%), Gemini-2.0-Flash excels in translation, GPT-4o leads in marketing (32.5%), and role-playing remains highly fragmented [9] Group 9: AI User Trends - A report by Menlo Ventures indicates that there are 1.8 billion AI users globally, with a low paid user rate of only 3%, and a high student usage rate of 85%, while parents are becoming heavy users [10] - AI is primarily used for email writing (19%), researching topics of interest (18%), and managing to-do lists (18%), with no single task dependency exceeding one-fifth [10] - The next 18-24 months are expected to see six major trends in AI: rise of vertical tools, complete process automation, multi-person collaboration, explosion of voice AI, physical AI in households, and diversification of business models [10]

生成式AI

大模型

Artificial Intelligence

Artificial Intelligence

Veo 3

SmolLM3

Skywork-R1V 3.0

AI日报丨五大投行集体唱多美股！“科技七巨头”扛起盈利大旗

美股研究社· 2025-07-09 11:25

Core Insights - The article emphasizes the rapid development of artificial intelligence technology and its potential opportunities in the market [1] Group 1: AI Developments - Hugging Face has open-sourced a new model, SmolLM3, which has 30 billion parameters and significantly outperforms similar models like Llama-3.2-3B and Qwen2.5-3B [3] - SmolLM3 supports six languages and features a 128k context window, allowing for flexible reasoning modes [3] Group 2: Market Performance - The Magnificent 7 index of major U.S. tech stocks fell by 0.07%, with notable movements including Tesla rebounding by 1.32% and Amazon dropping by 1.84% [4][5] - AMD and Eli Lilly saw increases of 2.24% and 0.62%, respectively, while Berkshire Hathaway B shares fell by 0.12% [5] Group 3: Corporate Actions - OpenAI's CEO, Sam Altman, downplayed concerns over talent acquisition from Meta Platforms, indicating no direct communication with Zuckerberg since the talent shifts began [5] - Meta Platforms has acquired a minority stake in EssilorLuxottica, valued at approximately €3 billion (about $3.5 billion), as part of its investment in the smart glasses sector [6] Group 4: Financial Outlook - Goldman Sachs has raised its year-end target for the S&P 500 index from 6100 to 6600, indicating a potential 5.9% upside for the U.S. stock market [6] - The upcoming earnings season is viewed as a critical moment for assessing market strength, with expectations of a 4.5% year-over-year growth in average earnings per share for S&P 500 constituents [7] - The weakening U.S. dollar, down 10% year-to-date, is expected to benefit large-cap tech companies that derive about 60% of their revenue from international markets [7] Group 5: Corporate Leadership Changes - Apple has appointed Sabih Khan as the new Chief Operating Officer, succeeding Jeff Williams, who will focus on design and health initiatives [10][11] - Khan has been with Apple since 1995 and has played a key role in the company's supply chain and manufacturing strategies [11]