Gemma3
Search documents
小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位· 2026-01-11 04:02
Core Insights - The article reveals significant findings regarding the 70M small model, emphasizing that the architecture's importance is lower than previously thought, while the model's "shape" (depth-width ratio) is more critical [1][2]. Group 1: Model Architecture and Performance - The optimal number of layers for small models is identified as 32, with 12 and 64 layers also performing well, while configurations with 16, 24, and 48 layers yield poor results [2][15]. - The performance gap between "good" and "bad" layer configurations exceeds 6 percentage points, with "good" configurations averaging around 38% accuracy and "bad" configurations around 32% [15][16]. - The hidden dimension must be at least 512 for optimal performance, with the 32-layer configuration achieving the highest score of 38.50% [18][23]. Group 2: Comparative Analysis of Architectures - A comparison of 12 different architectures, including LLaMA3 and Qwen3, shows that modern architectures perform similarly within the 70M parameter range, with average differences of less than 2% [25][26]. - The article notes that improvements in modern architectures are primarily designed for models with over 700 million parameters and do not provide measurable advantages for 70M models [27]. Group 3: Diffusion Models vs. Autoregressive Models - Diffusion models, while slightly lower in average accuracy (31-32%), demonstrate faster inference speeds (3.8 times faster) and lower hallucination rates compared to autoregressive models [28][30]. - The introduction of a "Canon layer" can enhance factual accuracy by 1% for autoregressive models and over 2% for diffusion models, with minimal additional parameter cost [35][36]. Group 4: New Model Development - The Dhara-70M model is introduced, combining the best features of autoregressive and diffusion models, built on the LLaMA3-Canon architecture and converted using the WSD method [41][42]. - The specifications of Dhara-70M include 71.34M parameters, 32 layers, and a hidden size of 384, designed for high throughput and factual accuracy [44]. Group 5: Recommendations for Model Builders - The article advises small language model builders to focus on the fundamental depth-width ratio rather than chasing the latest architectural trends, especially for applications requiring high-speed processing and factual accuracy [45].
剑桥揭开大模型翻车黑箱,别再怪它不懂推理,是行动出错了
3 6 Ke· 2025-10-13 10:46
Core Insights - The core argument of the article is that the challenges faced by large models in executing long-term tasks are not primarily due to reasoning capabilities but rather stem from their execution abilities [1][6][20]. Group 1: Execution Challenges - Large models exhibit a phenomenon where their performance declines as the length of tasks increases, indicating that execution stability is a critical area that requires more focus [6][14]. - The study highlights that even with improved single-step accuracy, the overall task execution can suffer due to a decrease in accuracy over multiple steps, a phenomenon referred to as self-conditioning [3][33]. - Researchers emphasize that the ability to execute plans reliably is essential, especially as the industry moves towards developing intelligent agents capable of handling entire projects rather than isolated problems [4][6]. Group 2: Performance Metrics - The researchers propose several metrics to evaluate the performance of large models, including Step Accuracy, Turn Accuracy, Turn Complexity, Task Accuracy, and Horizon Length [7][12]. - The findings indicate that as the number of steps in a task increases, the accuracy of the model tends to decline, which is critical for understanding the limitations of current models [9][31]. - The study reveals that larger models tend to maintain higher task accuracy over more rounds, suggesting that scaling up model size can enhance execution capabilities [32][36]. Group 3: Self-Conditioning Effect - The self-conditioning effect is identified as a significant factor contributing to the decline in accuracy during long-term tasks, where previous errors can lead to a higher likelihood of future mistakes [33][35]. - Experiments show that even with perfect knowledge and planning, models can still fail in long-chain tasks due to unstable execution [20][28]. - The research indicates that simply increasing model size does not alleviate the self-conditioning issue, which remains a challenge for long-term execution [36][37]. Group 4: Thinking Models - The article discusses the advantages of "thinking" models, which demonstrate improved resilience against self-conditioning and can execute longer tasks in a single round [43]. - These models, such as Qwen3 with thinking capabilities, show a significant improvement in task execution length compared to their non-thinking counterparts [43]. - The findings support the notion that a structured approach of reasoning before action can enhance the performance of large models in complex tasks [43].
X @Polyhedra
Polyhedra· 2025-09-25 12:00
6/Currently working on Gemma3 quantization, focusing on:- Learning the new model architecture- Adding KV cache support (which accelerates inference)- Implementing quantization support for some new operators-- Full operator support will require 1+ additional day, plus more time for accuracy testingStay tuned for more updates 🔥 ...
计算机ETF(512720)涨超1.6%,国产大模型技术突破或催化算力需求
Mei Ri Jing Ji Xin Wen· 2025-08-11 03:56
Group 1 - The core viewpoint of the news highlights the significant advancements in the Kimi K2 model, which utilizes 32 billion activation parameters to achieve trillion-level scalability and surpasses international open-source models like Gemma3 and Llama4, ranking in the top 5 of the large model arena [1] - The Kimi K2 model employs a self-developed MuonClip optimizer to overcome training stability issues and enhances task generalization capabilities through intelligent data synthesis technology inspired by ACEBench, enabling it to autonomously generate complex front-end code and accurately decompose instructions into structured sequences [1] - The open-source strategy of the Kimi K2 model is expected to lower AI agent development costs and drive innovation at the application layer, forming a full-stack product matrix with B-end enterprise-level APIs and C-end multimodal Kimi-VL, validating the potential for long-text and visual interaction scenarios [1] Group 2 - The Computer ETF (512720) has risen over 1.6%, tracking the CS Computer Index (930651), which selects listed companies involved in computer hardware, software, and services from the Shanghai and Shenzhen markets, reflecting the overall performance of computer-related securities with high growth and volatility characteristics [1]
OpenAI将启动5000万美元基金,支持非营利组织和社区组织;Kimi K2登顶全球开源模型冠军丨AIGC日报
创业邦· 2025-07-20 01:15
Group 1 - Manus co-founder Ji Yichao published a lengthy technical analysis reflecting on the company's journey from early success to recent challenges, including layoffs and account closures on domestic platforms [1] - Chinese models dominate the global open-source model rankings, with Kimi K2, DeepSeek R1, and Qwen3 taking the top three spots, outperforming Google's Gemma3 and Meta's Llama4, indicating a significant advancement in China's AI capabilities [1] - OpenAI announced a $50 million initial fund to support non-profit and community organizations, aiming to leverage AI for transformative impacts in education, economic opportunities, community organization, and healthcare [1] - Perplexity, an AI startup backed by Nvidia, is negotiating with mobile device manufacturers to pre-install its Comet AI mobile browser, challenging Google's dominance in the mobile market [2]
首创像素空间推理,7B模型领先GPT-4o,让VLM能像人类一样「眼脑并用」
量子位· 2025-06-09 09:27
Core Viewpoint - The article discusses the transition of Visual Language Models (VLM) from "perception" to "cognition," highlighting the introduction of "Pixel-Space Reasoning" which allows models to interact with visual information directly at the pixel level, enhancing their understanding and reasoning capabilities [1][2][3]. Group 1: Key Developments in VLM - The current mainstream VLMs are limited by their reliance on text tokens, which can lead to loss of critical information in high-resolution images and dynamic video scenes [2][4]. - "Pixel-Space Reasoning" enables models to perform visual operations directly, allowing for a more human-like interaction with visual data [3][6]. - This new reasoning paradigm shifts the focus from text-mediated understanding to native visual operations, enhancing the model's ability to capture spatial relationships and dynamic details [6][7]. Group 2: Overcoming Learning Challenges - The research team identified a "cognitive inertia" challenge where the model's established text reasoning capabilities hinder the development of new pixel operation skills, creating a "learning trap" [8][9]. - To address this, a reinforcement learning framework was designed that combines intrinsic curiosity incentives with extrinsic correctness rewards, encouraging the model to explore visual operations [9][12]. - The framework includes constraints to ensure a minimum rate of pixel-space reasoning and to balance exploration with computational efficiency [10][11]. Group 3: Performance Validation - The Pixel-Reasoner, based on the Qwen2.5-VL-7B model, achieved impressive results across four visual reasoning benchmarks, outperforming models like GPT-4o and Gemini-2.5-Pro [13][19]. - Specifically, it achieved an accuracy of 84.3% on the V* Bench, significantly higher than its competitors [13]. - The model demonstrated a 73.8% accuracy on TallyQA-Complex, showcasing its ability to differentiate between similar objects in images [19][20]. Group 4: Future Implications - The research indicates that pixel-space reasoning is not a replacement for text reasoning but rather a complementary pathway for VLMs, enabling a dual-track understanding of the world [21]. - As multi-modal reasoning capabilities evolve, the industry is moving towards a future where machines can "see more clearly and think more deeply" [21].
三星芯片,大搞AI
半导体芯闻· 2025-05-09 11:08
Core Viewpoint - Samsung Electronics' DS (Semiconductor) division is expanding its use of external AI models from Google and Microsoft, moving away from a closed internal AI system to enhance work efficiency in semiconductor design and development [1][2]. Group 1: Introduction of External AI Models - Samsung's DS department has officially introduced external open-source AI models, including Google's "Gemma3," Microsoft's "Phi-4," and Meta's "Llama4," to improve operational efficiency [1]. - The decision to adopt an open multi-model environment for internal AI aims to leverage the strengths of various AI models tailored to specific tasks, such as using Pi4 for numerical processing and Gemma3 for image analysis [2]. Group 2: Transition from Closed to Open AI Strategy - Previously, Samsung relied on a closed strategy with its internal AI, "DS Assistant," which faced limitations in utilizing external data and enhancing competitiveness in semiconductor design [2]. - The DS department had initially approved the use of ChatGPT in March 2023, but concerns over data security led to the development of a more secure internal AI solution [2]. Group 3: Internal Deployment of AI Models - The external AI models will be run internally on data servers to prevent internal information leakage, allowing for a secure environment while improving work efficiency [3]. - Samsung plans to review and support open-source AI models based on work types to further enhance productivity [3].