Transformer架构

Search documents
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
深度|英伟达最新挑战者Cerebras创始人对话谷歌前高管:我们正处于一个无法预测拐点的阶段
Z Potentials· 2025-08-15 03:53
Core Insights - The article discusses the transformative impact of AI on industries, emphasizing the role of open-source and data in global AI competition, as well as the challenges of AI safety and alignment, and the limitations of power in the development of AGI [2][16]. Group 1: AI Hardware Innovations - Cerebras Systems, led by CEO Andrew Feldman, is focused on creating the fastest and largest AI computing hardware, which is crucial for the growing demand for AI technologies [2][3]. - The company’s chip is 56 times larger than the largest known chip, designed specifically for AI workloads that require massive simple computations and unique memory access patterns [8][9]. - The collaboration between hardware and software is essential for accelerating AGI development, with a focus on optimizing matrix multiplication and memory access speeds [11][12]. Group 2: Open Source and Global Competition - The open-source ecosystem is seen as a vital area for innovation, particularly benefiting smaller companies and startups in competing against larger firms with significantly more capital [18][19]. - The cost of processing tokens has dramatically decreased, from $100 per million tokens to as low as $1.50 or $2, fostering innovation and broader application of technology [19]. - The competition in AI is perceived to be primarily between the US and China, with emerging markets also adopting Chinese open-source models [18]. Group 3: Power Supply and AGI Development - Power supply is identified as a critical limitation for AGI development, with high electricity costs in Europe posing challenges [42][45]. - The discussion highlights the need for significant energy resources, such as nuclear power, to support large data centers essential for AI operations [44][46]. - The article suggests that the future of AGI may depend on the establishment of new nuclear power plants to meet the energy demands of advanced AI systems [46]. Group 4: AI Safety and Alignment - AI alignment refers to ensuring that AI systems reflect human values and norms, with ongoing efforts to develop testing methods to check for potential dangers in AI models [35][36]. - The challenge remains in maintaining alignment in self-improving systems, raising concerns about the potential risks of releasing advanced AI without proper oversight [37][38]. - The responsibility for AI safety is shared between hardware and software, emphasizing the need for collaboration in addressing these challenges [39].
GPT5发布标志:以Tranformer为架构的大语言模型即将走到尽头,下一波浪潮在哪?
老徐抓AI趋势· 2025-08-15 03:00
Core Viewpoint - The release of GPT-5 marks a significant moment in the AI industry, indicating a shift from a transformative era of large language models to a more incremental improvement phase, suggesting that the Transformer architecture may be reaching its limits [6][56]. Performance Analysis - GPT-5 shows improvements in various core metrics, such as achieving a 94.6% accuracy in the AIME math competition without tools and 100% with tools, but the progress compared to previous models is less dramatic [9][12]. - In the HLE human ultimate exam, GPT-5 Pro achieved 42%, a notable increase from the previous model's 24.3% [16]. - For programming capabilities, GPT-5 scored 74.9% in the SWE Bench Verified test, slightly surpassing Anthropic's Claude Opus 4.1 [21][24]. - The cost of using GPT-5 is significantly lower than its competitors, with input costs at $1.25 per million tokens, indicating a potential price competition in the market [26][27]. Industry Trends - The release event for GPT-5 was more elaborate but lacked the excitement of earlier launches, reflecting a shift in how OpenAI presents its advancements [8][9]. - The AI industry is moving towards a phase where quality and user experience are prioritized alongside capability, indicating a maturation of the market [8][12]. - The potential saturation of training data and parameters suggests that the industry may soon face challenges in achieving further breakthroughs with current architectures [34][37]. Future Directions - Two potential future directions for AI development are algorithmic innovation, such as hierarchical reasoning models, and upgrading data types to include more complex modalities like video and sensor data [38][41]. - The industry is transitioning from a phase of "superior quality" to "lower prices," which could lead to a competitive environment where profit margins are squeezed [43]. Conclusion - The release of GPT-5 signifies both a peak and a potential turning point in the AI landscape, with future advancements likely requiring new architectures or data modalities to sustain growth [56].
万字解析DeepSeek MOE架构!
自动驾驶之心· 2025-08-14 23:33
Core Viewpoint - The article provides a comprehensive overview of the Mixture of Experts (MoE) architecture, particularly focusing on the evolution and implementation of DeepSeek's MoE models (V1, V2, V3) and their optimizations in handling token distribution and load balancing in AI models [2][21][36]. Group 1: MoE Architecture Overview - MoE, or Mixture of Experts, is a model architecture that utilizes multiple expert networks to enhance performance, particularly in sparse settings suitable for cloud computing [2][3]. - The initial interest in MoE architecture surged with the release of Mistral.AI's Mixtral model, which highlighted the potential of sparse architectures in AI [2][3]. - The Switch Transformer model introduced a routing mechanism that allows tokens to select the top-K experts, optimizing the processing of diverse knowledge [6][10]. Group 2: DeepSeek V1 Innovations - DeepSeek V1 addresses two main issues in existing MoE practices: knowledge mixing and redundancy, which hinder expert specialization [22][24]. - The model introduces fine-grained expert division and shared experts to enhance specialization and reduce redundancy, allowing for more efficient knowledge capture [25][26]. - The architecture includes a load balancing mechanism to ensure even distribution of tokens across experts, mitigating training inefficiencies [32]. Group 3: DeepSeek V2 Enhancements - DeepSeek V2 builds on V1's design, implementing three key optimizations focused on load balancing [36]. - The model limits the number of devices used for routing experts to reduce communication overhead, enhancing efficiency during training and inference [37]. - A new communication load balancing loss function is introduced to ensure equitable token distribution across devices, further optimizing performance [38]. Group 4: DeepSeek V3 Developments - DeepSeek V3 introduces changes in the MOE layer computation, replacing the softmax function with a sigmoid function to improve computational efficiency [44]. - The model eliminates auxiliary load balancing losses, instead using a learnable bias term to control routing, which enhances load balancing during training [46]. - A sequence-level auxiliary loss is added to prevent extreme imbalances within individual sequences, ensuring a more stable training process [49].
千支队伍争锋!首届「启智杯」算法大赛圆满落幕,助推AI应用落地
机器之心· 2025-08-14 04:57
Core Viewpoint - Artificial intelligence is transitioning from theoretical exploration to large-scale application, becoming a new engine for high-quality economic and social development in China [1] Group 1: Event Overview - The "Qizhi Cup" algorithm innovation application challenge was officially launched on May 20, 2025, by Qiyuan Laboratory, aiming to promote the practical application of intelligent algorithms [1] - The competition attracted 1,022 teams from universities, research institutions, and technology companies, with three teams winning in different tracks [2][20] Group 2: Competition Tracks - The competition featured three main tracks: "Robust Instance Segmentation of Satellite Remote Sensing Images," "Drone Ground Target Detection for Embedded Platforms," and "Adversarial Challenges for Multimodal Large Models" [4][20] - Each track focused on core capabilities such as robust perception, lightweight deployment, and adversarial defense [4] Group 3: Track Summaries Robust Instance Segmentation of Satellite Remote Sensing Images - This track aimed at precise segmentation of complex targets in high-resolution remote sensing images, addressing challenges like occlusion and domain differences [6] - The champion team from South China University of Technology utilized an optimized Co-DETR model, enhancing feature learning through multi-task training [8][9] Drone Ground Target Detection for Embedded Platforms - This track required algorithms to achieve high recognition accuracy while operating efficiently on resource-constrained platforms [9][21] - The winning team, "Duan Yan Wu Ping," achieved high precision under hardware limitations by transitioning from YOLOv11 to a Transformer-based Co-DETR model [10][12] Adversarial Challenges for Multimodal Large Models - This track evaluated models on accuracy, robustness, and resistance to attacks in visible light remote sensing scenarios [14] - The winning team from Sun Yat-sen University developed a robust and reliable model using a systematic optimization approach [16][18] Group 4: Industry Implications - The "Qizhi Cup" serves as a platform for integrating cutting-edge algorithms with practical applications, emphasizing the adaptability and engineering feasibility of models in dynamic environments [20][21] - The competition fosters AI talent development, enhancing participants' understanding of business and data while bridging the gap between theory and engineering [23]
告别Transformer,重塑机器学习范式:上海交大首个「类人脑」大模型诞生
机器之心· 2025-08-13 09:29
Core Viewpoint - The article discusses the introduction of BriLLM, a new language model inspired by human brain mechanisms, which aims to overcome the limitations of traditional Transformer-based models, such as high computational demands, lack of interpretability, and context size restrictions [3][8]. Group 1: Limitations of Current Models - Current Transformer-based models face three main issues: high computational requirements, black-box interpretability, and context size limitations [6][8]. - The self-attention mechanism in Transformers has a time and space complexity of O(n²), leading to increased computational costs as input length grows [7]. - The internal logic of Transformers lacks transparency, making it difficult to understand the decision-making process within the model [7][8]. Group 2: Innovations of BriLLM - BriLLM introduces a new learning mechanism called SiFu (Signal Fully-connected Flowing), which replaces traditional prediction operations with signal transmission, mimicking the way neural signals operate in the brain [9][13]. - The model architecture is based on a directed graph, allowing all nodes to be interpretable, unlike traditional models that only provide limited interpretability at the input and output layers [9][19]. - BriLLM supports unlimited context processing without increasing model parameters, allowing for efficient handling of long sequences [15][16]. Group 3: Model Specifications - BriLLM has two versions: BriLLM-Chinese and BriLLM-English, with non-sparse model sizes of 16.90 billion parameters for both languages [21]. - The sparse version of the Chinese model has 2.19 billion parameters, while the English version has 0.96 billion parameters, achieving a parameter reduction of approximately 90% [21]. - The model's design allows for the integration of multiple modalities, enabling it to process not just language but also visual and auditory inputs [25][26]. Group 4: Future Prospects - The team aims to develop a multi-modal brain-inspired AGI framework, which will integrate perception and motion [27]. - BriLLM has been selected for funding under Shanghai Jiao Tong University's "SJTU 2030" plan, which supports groundbreaking research projects [27].
深聊GPT-5发布:过度营销的反噬与AI技术突破的困局
Hu Xiu· 2025-08-12 09:05
Core Insights - GPT-5 has been released, but it does not represent a significant step towards Artificial General Intelligence (AGI) [1] - The launch event revealed several issues, including presentation errors and reliance on debunked theories, which highlighted weaknesses in the Transformer architecture [1] - Despite these shortcomings, GPT-5 is still considered a competent AI product, and OpenAI plans to implement aggressive commercialization strategies in key sectors [1] Technical Development - The development of GPT-5 faced various technical bottlenecks, leading to the choice of a specific architecture to overcome these challenges [1] - The limitations of the Scaling law have been encountered, raising questions about future technological pathways for AI advancement [1] Commercial Strategy - OpenAI aims to rapidly establish a presence in three main application areas: education, healthcare, and programming [1] - The company's approach suggests a focus on leveraging GPT-5's capabilities to solidify its market position [1]
国泰海通|产业:AI Agent的技术演进与产业洞察
国泰海通证券研究· 2025-08-08 09:24
Core Insights - The evolution of AI Agents is fundamentally driven by the paradigm shift towards large language models (LLMs) as the "brain," showcasing commercial value through vertical applications that address specific industry pain points and high precision [1][2] - AI Agents are reshaping software development and human-computer interaction, transitioning from traditional architectures to modern LLM-based frameworks that enable autonomous planning, environmental perception, and tool invocation [1][2] Technical Evolution - The core of AI Agent's technological advancement lies in the significant changes introduced by modern LLM architectures, moving away from traditional architectures that were limited by hardware and pre-programmed rules [2] - The modern LLM-based Agent architecture consists of three main modules: brain, perception, and action, allowing multiple specialized agents to collaborate or compete to overcome the limitations of single agents in handling complex tasks [2] Industry Chain Formation - A complete industry chain is emerging with upstream dominated by a few tech giants providing foundational models and computing power, while the midstream sees the rise of open-source frameworks and platforms that lower development barriers [3] - Downstream applications are categorized into general-purpose agents for complex multi-step tasks and vertical agents deeply integrated with industry knowledge, showing significant commercial value in sectors like software development, law, finance, and healthcare [3] Challenges and Future Trajectory - Despite rapid advancements, AI Agents face challenges such as limitations in LLM's planning and reasoning capabilities, context window constraints, memory bottlenecks, multi-agent collaboration issues, and evaluation dilemmas [3] - The future development of AI Agents will depend on the continuous evolution of foundational LLMs, the proliferation of multimodal perception capabilities, and the restructuring of the software and hardware ecosystem, moving closer to AGI [3]
明显感觉程序员的面试已经变了。。
猿大侠· 2025-07-23 03:25
Core Viewpoint - The article emphasizes the importance of integrating existing programming skills with large model technologies to enhance career prospects in the AI field, rather than abandoning current skills [1]. Summary by Sections Course Overview - A course titled "Large Model Application Development Practical Training" is designed to help developers master AI application development from scratch through practical projects and code breakdowns [1]. - The course includes insights from industry experts and real case studies from major companies, providing participants with high-paying job opportunities and internal referrals [1][15]. Course Content - The curriculum covers essential concepts such as RAG (Retrieval-Augmented Generation), AI Agent, and Transformer architecture, focusing on practical applications and fine-tuning techniques [9][11]. - It consists of five modules: basics, tools, advanced topics, competitions, and practical applications, ensuring a comprehensive learning path [9]. Target Audience - The course is aimed at developers looking to connect with product teams, build technical barriers, avoid job insecurity, and enhance their skills for future career development [13]. - It is particularly relevant for programmers concerned about job stability as they age, especially those nearing the 35-year mark [13]. Success Metrics - The course has successfully served over 20,000 students, receiving positive feedback and helping many secure high-paying job offers [11]. - Participants learn to customize models for specific industries such as manufacturing, healthcare, and finance, improving task accuracy and efficiency [11]. Practical Experience - The course includes detailed case studies of popular AI applications, allowing participants to gain hands-on experience and build a portfolio of practical projects [16]. - Students will learn to implement AI technologies in various business scenarios, enhancing their employability [16]. Career Development - The course offers insights into current job market trends for large model technologies, including salary expectations and career growth opportunities [20]. - Continuous internal referral opportunities are provided, ensuring participants have a direct pathway to high-paying positions in leading companies [20].
最近,程序员的招聘市场已经疯掉了。。。
程序员的那些事· 2025-07-22 03:48
Core Viewpoint - The article emphasizes the importance of integrating existing programming skills with large model technologies to enhance career prospects and salary opportunities in the AI field [1]. Group 1: Course Offerings - A course titled "Large Model Application Development Practical Training" is designed to help developers master the complete AI application development process through practical projects and code breakdown [1]. - The course covers essential technologies such as RAG, AI Agent, and Transformer architecture, providing a comprehensive learning path from basics to advanced applications [8]. - The course has served over 20,000 students and has received positive feedback, with many participants securing high-paying job offers [10]. Group 2: Learning Outcomes - Participants will learn to fine-tune mainstream large models like DeepSeek and Qwen for specific scenarios, improving model performance and task accuracy [10]. - The course includes practical applications of RAG technology for efficient knowledge retrieval and generation in various sectors such as law, healthcare, and finance [10]. - Students will also learn to design and develop AI Agents for multi-task collaboration and complex problem-solving in industry-specific contexts [10]. Group 3: Career Development - The course aims to help participants build technical barriers, avoid job insecurity, and enhance their career development over the next 20 years [12]. - It offers insights into current job market trends, salary expectations, and career paths from the perspective of hiring managers [19]. - The program provides reliable internal referral opportunities and direct hiring benefits, facilitating quicker access to high-paying job offers [19].