Transformer架构
Search documents
AI解数学题只靠最后一个token
量子位· 2025-09-14 05:05
Core Insights - The research indicates that in mental arithmetic tasks, the majority of calculations are concentrated on the last token, rather than being distributed across all tokens, suggesting that global information access is not necessary for specific tasks like mental arithmetic [1][11]. Group 1: Research Methodology - Researchers employed Context-Aware Mean Ablation (CAMA) and attention-based peeking techniques to conduct a series of ablation experiments on models like Llama-3-8B [2][22]. - The experiments aimed to identify the "minimum computation" required for models to perform well by systematically removing or altering parts of the model [3]. - A sparse subgraph termed "All-for-One" (AF1) was identified, which allows efficient computation with minimal layers and limited information transfer [4][5]. Group 2: Model Structure and Functionality - In the AF1 structure, initial layers (L_wait) do not perform calculations related to their own values but instead focus on general preparatory tasks [7]. - Information is transferred to the last token through intermediate layers (L_transfer), which then independently performs the final calculations [8][9]. - This separation of general computation and input-specific computation highlights the model's efficiency in handling arithmetic tasks [10]. Group 3: Experimental Findings - The experiments revealed that Llama-3-8B requires only the first 14 layers for general computation, followed by 2 layers for information transfer, with the remaining layers dedicated to the last token's self-computation [24][26]. - AF1_llama demonstrated high fidelity across eight tasks, maintaining performance levels close to the original model [28][29]. - The importance of specific attention heads in arithmetic calculations was confirmed, with the model retaining approximately 95% accuracy even after removing nearly 60 heads, indicating redundancy in attention heads [30]. Group 4: Generalization and Limitations - AF1_llama was tested for its ability to generalize to other arithmetic forms, showing high accuracy in direct arithmetic tasks but failing in tasks requiring semantic understanding, such as word problems and Python code [32][34]. - Similar AF1-like subgraphs were found in Pythia and GPT-J models, although these models exhibited shorter waiting periods and less clear performance boundaries compared to Llama [35][36]. Group 5: Contributions and Innovations - This research contributes to the understanding of arithmetic reasoning and cross-token computation mechanisms in large language models [37]. - The methodologies introduced, CAMA and ABP, offer innovative approaches that could extend beyond arithmetic tasks to broader applications [37].
当导师让我去看多模态感知研究方向后......
自动驾驶之心· 2025-09-07 23:34
Core Viewpoint - The article discusses the ongoing debate in the automotive industry regarding the safety and efficacy of different sensor technologies for autonomous driving, particularly focusing on the advantages of LiDAR over radar systems as emphasized by Elon Musk [1]. Summary by Sections Section 1: Sensor Technology in Autonomous Driving - LiDAR provides significant advantages such as long-range perception, high frame rates for real-time sensing, robustness in adverse conditions, and three-dimensional spatial awareness, addressing key challenges in autonomous driving perception [1]. - The integration of multiple sensor types, including LiDAR, radar, and cameras, enhances the reliability of autonomous systems through multi-sensor fusion, which is currently the mainstream approach in high-end intelligent driving production [1]. Section 2: Multi-Modal Fusion Techniques - Traditional fusion methods are categorized into three types: early fusion, mid-level fusion, and late fusion, each with its own strengths and weaknesses [2]. - The current cutting-edge approach is end-to-end fusion based on Transformer architecture, which leverages cross-modal attention mechanisms to learn deep relationships between different data modalities, improving efficiency and robustness in feature interaction [2]. Section 3: Educational Initiatives - There is a growing interest among graduate students in the field of multi-modal perception fusion, with many seeking guidance and mentorship to enhance their understanding and practical skills [2]. - A structured course is offered to help students systematically grasp key theoretical knowledge, develop practical coding skills, and improve their academic writing capabilities [5][10]. Section 4: Course Structure and Outcomes - The course spans 12 weeks of online group research followed by 2 weeks of paper guidance, culminating in a 10-week maintenance period for the research paper [21]. - Participants will gain insights into classic and cutting-edge research papers, coding implementations, and methodologies for selecting topics, conducting experiments, and writing papers [20][21].
晚点独家丨理想自研智驾芯片上车路测,部分计算性能超英伟达 Thor-U
晚点LatePost· 2025-08-28 06:09
Core Viewpoint - Li Auto's self-developed autonomous driving chip M100 has successfully passed key pre-mass production stages and is expected to be mass-produced next year, aiming to enhance efficiency and cost-effectiveness in its autonomous driving algorithms [4][6]. Summary by Sections Chip Development - Li Auto's M100 chip has completed functional and performance testing, demonstrating significant computational capabilities, such as matching the effective computing power of 2 NVIDIA Thor-U chips for large language model tasks and 3 Thor-U chips for traditional visual tasks [4][6]. - The company has allocated a budget of several billion dollars for the development of its self-research chip project, indicating the high costs associated with chip development [6]. Strategic Approach - Li Auto is adopting a dual strategy: relying on external partners like NVIDIA and Horizon for current market competitiveness while developing its own chip for future core advantages [7][8]. - The CTO of Li Auto, Xie Yan, is leading a strategy that combines hardware and software development to maximize chip performance and efficiency [6]. Market Positioning - In its current electric vehicle lineup, Li Auto is using NVIDIA's high-performance chips in flagship models, while employing a mixed strategy in its range-extended models by using either NVIDIA Thor-U or Horizon Journey 6M chips based on different autonomous driving versions [8]. - The core reason for developing its own chip is to optimize performance specifically for Li Auto's algorithms, enhancing cost-effectiveness and efficiency [8].
独家丨理想自研智驾芯片上车路测,部分计算性能超英伟达 Thor-U
晚点Auto· 2025-08-28 03:51
Core Viewpoint - Li Auto's self-developed autonomous driving chip M100 has successfully passed key pre-mass production stages and is expected to be mass-produced next year, enhancing the company's competitive edge in the autonomous driving market [3][5]. Group 1: Chip Development and Performance - The M100 chip has demonstrated specific performance characteristics, providing effective computing power comparable to 2 NVIDIA Thor-U chips for large language model tasks and equivalent to 3 Thor-U chips for traditional visual tasks like image recognition [3][5]. - Li Auto has allocated a budget of several billion dollars for the development of its self-research chip project, indicating the significant investment required for such technology [5]. Group 2: Strategic Partnerships and Current Solutions - Until the M100 chip is mass-produced, Li Auto will continue to rely on existing partnerships with NVIDIA and Horizon Robotics for its current chip solutions [5][7]. - The company employs a mixed strategy for its range-extended models, using either NVIDIA Thor-U or Horizon's Journey 6M chips based on the specific version of its AD Max and AD Pro autonomous driving systems [7]. Group 3: R&D Strategy and Challenges - Li Auto's CTO, Xie Yan, is driving a strategy that combines hardware and software development to maximize chip performance and efficiency, aiming to outperform competitors [5][6]. - The integration of hardware and software in chip development is complex, requiring deep technical expertise and effective collaboration across departments [6].
Meta没做的,英伟达做了,全新架构吞吐量狂飙6倍,20万亿Token训练
3 6 Ke· 2025-08-19 02:33
Core Insights - NVIDIA has launched a new 9B model, the NVIDIA Nemotron Nano 2, utilizing a revolutionary Mamba-Transformer hybrid architecture that achieves up to 6 times higher inference throughput compared to the industry benchmark Qwen3-8B, while maintaining or exceeding performance in complex reasoning tasks [1][23]. Group 1: Model Architecture and Performance - The Nemotron Nano 2 model is based on the innovative Mamba-2 architecture, which replaces most self-attention layers in traditional Transformer architectures, resulting in significant speed improvements during complex reasoning tasks [10][15]. - The model demonstrates competitive accuracy in various benchmarks, including mathematics, code generation, and general reasoning, performing on par or better than similar open-source models like Qwen3-8B and Gemma3-12B [23][24]. - In specific benchmarks, the model achieved notable scores, such as 97.8% in MATH500 and 72.1% in AIME25, showcasing its capabilities in mathematical reasoning and general knowledge [24]. Group 2: Training and Data Utilization - The training process for the Nemotron Nano 2 involved a massive dataset of 20 trillion tokens, utilizing advanced FP8 training techniques to create a foundational model with 120 billion parameters, which was later distilled to 9 billion parameters [17][22]. - The model's training included high-quality data from various sources, focusing on mathematics, code, and multilingual question-answering, ensuring a robust pre-training dataset [18][25]. - NVIDIA has also released a comprehensive pre-training dataset, Nemotron-Pre-Training-Dataset-v1, which includes 6.6 trillion tokens from diverse domains, further enhancing the model's training foundation [25][27]. Group 3: Open Source Commitment - NVIDIA has committed to open-sourcing the Nemotron models on the HuggingFace platform, providing access to the 9B model, its base version, and the larger 12B model, along with the associated datasets [25][30]. - This move reflects NVIDIA's ongoing efforts to contribute to the open-source community, contrasting with other companies that are shifting towards more closed-source strategies [27].
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
深度|英伟达最新挑战者Cerebras创始人对话谷歌前高管:我们正处于一个无法预测拐点的阶段
Z Potentials· 2025-08-15 03:53
Core Insights - The article discusses the transformative impact of AI on industries, emphasizing the role of open-source and data in global AI competition, as well as the challenges of AI safety and alignment, and the limitations of power in the development of AGI [2][16]. Group 1: AI Hardware Innovations - Cerebras Systems, led by CEO Andrew Feldman, is focused on creating the fastest and largest AI computing hardware, which is crucial for the growing demand for AI technologies [2][3]. - The company’s chip is 56 times larger than the largest known chip, designed specifically for AI workloads that require massive simple computations and unique memory access patterns [8][9]. - The collaboration between hardware and software is essential for accelerating AGI development, with a focus on optimizing matrix multiplication and memory access speeds [11][12]. Group 2: Open Source and Global Competition - The open-source ecosystem is seen as a vital area for innovation, particularly benefiting smaller companies and startups in competing against larger firms with significantly more capital [18][19]. - The cost of processing tokens has dramatically decreased, from $100 per million tokens to as low as $1.50 or $2, fostering innovation and broader application of technology [19]. - The competition in AI is perceived to be primarily between the US and China, with emerging markets also adopting Chinese open-source models [18]. Group 3: Power Supply and AGI Development - Power supply is identified as a critical limitation for AGI development, with high electricity costs in Europe posing challenges [42][45]. - The discussion highlights the need for significant energy resources, such as nuclear power, to support large data centers essential for AI operations [44][46]. - The article suggests that the future of AGI may depend on the establishment of new nuclear power plants to meet the energy demands of advanced AI systems [46]. Group 4: AI Safety and Alignment - AI alignment refers to ensuring that AI systems reflect human values and norms, with ongoing efforts to develop testing methods to check for potential dangers in AI models [35][36]. - The challenge remains in maintaining alignment in self-improving systems, raising concerns about the potential risks of releasing advanced AI without proper oversight [37][38]. - The responsibility for AI safety is shared between hardware and software, emphasizing the need for collaboration in addressing these challenges [39].
GPT5发布标志:以Tranformer为架构的大语言模型即将走到尽头,下一波浪潮在哪?
老徐抓AI趋势· 2025-08-15 03:00
Core Viewpoint - The release of GPT-5 marks a significant moment in the AI industry, indicating a shift from a transformative era of large language models to a more incremental improvement phase, suggesting that the Transformer architecture may be reaching its limits [6][56]. Performance Analysis - GPT-5 shows improvements in various core metrics, such as achieving a 94.6% accuracy in the AIME math competition without tools and 100% with tools, but the progress compared to previous models is less dramatic [9][12]. - In the HLE human ultimate exam, GPT-5 Pro achieved 42%, a notable increase from the previous model's 24.3% [16]. - For programming capabilities, GPT-5 scored 74.9% in the SWE Bench Verified test, slightly surpassing Anthropic's Claude Opus 4.1 [21][24]. - The cost of using GPT-5 is significantly lower than its competitors, with input costs at $1.25 per million tokens, indicating a potential price competition in the market [26][27]. Industry Trends - The release event for GPT-5 was more elaborate but lacked the excitement of earlier launches, reflecting a shift in how OpenAI presents its advancements [8][9]. - The AI industry is moving towards a phase where quality and user experience are prioritized alongside capability, indicating a maturation of the market [8][12]. - The potential saturation of training data and parameters suggests that the industry may soon face challenges in achieving further breakthroughs with current architectures [34][37]. Future Directions - Two potential future directions for AI development are algorithmic innovation, such as hierarchical reasoning models, and upgrading data types to include more complex modalities like video and sensor data [38][41]. - The industry is transitioning from a phase of "superior quality" to "lower prices," which could lead to a competitive environment where profit margins are squeezed [43]. Conclusion - The release of GPT-5 signifies both a peak and a potential turning point in the AI landscape, with future advancements likely requiring new architectures or data modalities to sustain growth [56].
万字解析DeepSeek MOE架构!
自动驾驶之心· 2025-08-14 23:33
Core Viewpoint - The article provides a comprehensive overview of the Mixture of Experts (MoE) architecture, particularly focusing on the evolution and implementation of DeepSeek's MoE models (V1, V2, V3) and their optimizations in handling token distribution and load balancing in AI models [2][21][36]. Group 1: MoE Architecture Overview - MoE, or Mixture of Experts, is a model architecture that utilizes multiple expert networks to enhance performance, particularly in sparse settings suitable for cloud computing [2][3]. - The initial interest in MoE architecture surged with the release of Mistral.AI's Mixtral model, which highlighted the potential of sparse architectures in AI [2][3]. - The Switch Transformer model introduced a routing mechanism that allows tokens to select the top-K experts, optimizing the processing of diverse knowledge [6][10]. Group 2: DeepSeek V1 Innovations - DeepSeek V1 addresses two main issues in existing MoE practices: knowledge mixing and redundancy, which hinder expert specialization [22][24]. - The model introduces fine-grained expert division and shared experts to enhance specialization and reduce redundancy, allowing for more efficient knowledge capture [25][26]. - The architecture includes a load balancing mechanism to ensure even distribution of tokens across experts, mitigating training inefficiencies [32]. Group 3: DeepSeek V2 Enhancements - DeepSeek V2 builds on V1's design, implementing three key optimizations focused on load balancing [36]. - The model limits the number of devices used for routing experts to reduce communication overhead, enhancing efficiency during training and inference [37]. - A new communication load balancing loss function is introduced to ensure equitable token distribution across devices, further optimizing performance [38]. Group 4: DeepSeek V3 Developments - DeepSeek V3 introduces changes in the MOE layer computation, replacing the softmax function with a sigmoid function to improve computational efficiency [44]. - The model eliminates auxiliary load balancing losses, instead using a learnable bias term to control routing, which enhances load balancing during training [46]. - A sequence-level auxiliary loss is added to prevent extreme imbalances within individual sequences, ensuring a more stable training process [49].
千支队伍争锋!首届「启智杯」算法大赛圆满落幕,助推AI应用落地
机器之心· 2025-08-14 04:57
Core Viewpoint - Artificial intelligence is transitioning from theoretical exploration to large-scale application, becoming a new engine for high-quality economic and social development in China [1] Group 1: Event Overview - The "Qizhi Cup" algorithm innovation application challenge was officially launched on May 20, 2025, by Qiyuan Laboratory, aiming to promote the practical application of intelligent algorithms [1] - The competition attracted 1,022 teams from universities, research institutions, and technology companies, with three teams winning in different tracks [2][20] Group 2: Competition Tracks - The competition featured three main tracks: "Robust Instance Segmentation of Satellite Remote Sensing Images," "Drone Ground Target Detection for Embedded Platforms," and "Adversarial Challenges for Multimodal Large Models" [4][20] - Each track focused on core capabilities such as robust perception, lightweight deployment, and adversarial defense [4] Group 3: Track Summaries Robust Instance Segmentation of Satellite Remote Sensing Images - This track aimed at precise segmentation of complex targets in high-resolution remote sensing images, addressing challenges like occlusion and domain differences [6] - The champion team from South China University of Technology utilized an optimized Co-DETR model, enhancing feature learning through multi-task training [8][9] Drone Ground Target Detection for Embedded Platforms - This track required algorithms to achieve high recognition accuracy while operating efficiently on resource-constrained platforms [9][21] - The winning team, "Duan Yan Wu Ping," achieved high precision under hardware limitations by transitioning from YOLOv11 to a Transformer-based Co-DETR model [10][12] Adversarial Challenges for Multimodal Large Models - This track evaluated models on accuracy, robustness, and resistance to attacks in visible light remote sensing scenarios [14] - The winning team from Sun Yat-sen University developed a robust and reliable model using a systematic optimization approach [16][18] Group 4: Industry Implications - The "Qizhi Cup" serves as a platform for integrating cutting-edge algorithms with practical applications, emphasizing the adaptability and engineering feasibility of models in dynamic environments [20][21] - The competition fosters AI talent development, enhancing participants' understanding of business and data while bridging the gap between theory and engineering [23]