Workflow
Transformer架构
icon
Search documents
苏姿丰和她的“去英伟达”战争
创业邦· 2026-03-18 03:40
Core Viewpoint - AMD has experienced significant growth under CEO Lisa Su, with its market value increasing from less than $3 billion in 2014 to over $315 billion today, marking a growth of more than 100 times [5] - The competition in the AI chip market has evolved into a multi-dimensional battle involving performance, cost optimization, and sustainable energy reliance, with key players like TSMC and Samsung playing crucial roles in supply chain dynamics [5] Group 1: AMD's Strategic Moves - AMD's shift towards AI began in 2018 with the launch of the Instinct series data center GPUs, marking its entry into AI workloads [9] - Lisa Su has emphasized the accelerating demand for AI, predicting a ten-year AI supercycle, and highlighted the importance of inference capabilities with the introduction of the MI300X series [9][11] - The MI350 series, announced in mid-2025, boasts a 35 times performance improvement over its predecessor, indicating AMD's commitment to advancing AI technology [11] Group 2: Partnerships and Collaborations - AMD has signed a 6-gigawatt GPU supply agreement with OpenAI, which includes warrants for AMD stock, indicating a strategic partnership that aligns both companies' interests [19] - A similar agreement was established with Meta, further solidifying AMD's position in the AI infrastructure market [19] - These partnerships are seen as transformative for AMD, accelerating procurement and ecosystem development [20] Group 3: Supply Chain and Production Capacity - AMD has secured 8% of TSMC's advanced packaging capacity, which is critical for its production of AI chips [27] - The company anticipates producing 900,000 MI400 chips in 2026, necessitating a stable supply of HBM memory, which is currently a challenge due to competition with Nvidia [27][28] - Lisa Su's upcoming visit to South Korea aims to secure memory supply agreements with key partners like Samsung, highlighting the importance of supply chain management in AMD's strategy [28][29] Group 4: Financial Performance and Future Outlook - AMD reported record revenue, net profit, and free cash flow in 2025, with data center revenue growing 39% year-over-year to $5.4 billion [30] - The company aims for a compound annual growth rate of over 35% in the next three to five years, driven by the expansion of its data center AI business [30] - AMD's strategy focuses on deepening ties with major clients, investing in future technologies, and building a robust supply chain ecosystem to reduce reliance on Nvidia [30]
2017,制造奥本海默
创业邦· 2026-03-12 10:22
Core Insights - The article discusses the revolutionary impact of the Transformer architecture introduced in the paper "Attention Is All You Need" by Google researchers in 2017, which has become the foundation for various AI advancements, including ChatGPT [6][7][13]. - It highlights the initial underestimation of the Transformer model's significance by major tech companies, particularly Google, which was more focused on other AI projects like AlphaGo and DeepMind [9][10][12]. - The rapid growth of ChatGPT, which gained over 1 million users within five days and 100 million in two months, signifies a new industrial revolution in AI [13]. Group 1: Historical Context - The article traces the evolution of AI, starting from Geoffrey Hinton's work in computer vision in 2012, which laid the groundwork for AI commercialization [16][18]. - It contrasts the advancements in computer vision with the struggles faced by natural language processing (NLP) until the introduction of the Transformer model [19][20]. Group 2: Technical Developments - The introduction of the Attention mechanism in Google's GNMT system aimed to improve machine translation but was limited by the inefficiencies of RNNs [24][25]. - The Transformer model eliminated RNNs, utilizing self-attention and parallel processing, which significantly enhanced computational efficiency [25][26]. Group 3: Competitive Landscape - OpenAI was the first to leverage the Transformer architecture effectively, leading to the development of the GPT series, starting with GPT-1 in 2018 [30][31]. - The competition intensified with the release of BERT by Google, which outperformed GPT-1 in various benchmarks, leading to a divergence in technical philosophies between OpenAI and Google [34][35]. Group 4: Scaling Laws and Industry Impact - The concept of Scaling Laws, which posits that increasing model parameters and computational resources enhances performance, became a focal point in AI development, particularly with the release of GPT-3 [40][41]. - The success of GPT-3, with 175 billion parameters, demonstrated the viability of Scaling Laws and triggered a rush among companies to develop competitive models [45][46]. Group 5: Ethical Considerations and Future Directions - Concerns regarding the ethical implications of AI models, particularly around the potential for harmful content, led to the development of InstructGPT, which aimed to align AI outputs with human values [49][50]. - The article concludes by emphasizing the ongoing tension between technological advancement and ethical considerations in AI, suggesting that while humanity is closer to achieving general AI, significant challenges remain [56][57].
大模型:超人智能诞生,迈向硅基文明
泽平宏观· 2026-03-11 16:06
Core Insights - AI large models are revolutionizing human life and work paradigms, breaking traditional skill barriers and enabling ordinary individuals to become super individuals or one-person companies [3][5] - Approximately 84% of the global population has never interacted with AI, indicating a significant opportunity for AI infrastructure similar to the internet's early days [3][12] - The development of large models will follow five decisive trends, including exponential growth in reasoning power demand and the transition from pre-training to post-training as a core breakthrough [4][25] Group 1: Social Impact of AI Large Models - The emergence of AI assistants like ChatGPT and Gemini represents a shift towards more intelligent tools that can perform complex tasks [5] - Skills barriers are being dismantled, allowing anyone to become a creator, with tools enabling non-coders to develop software through natural language [5][6] - The education system will need to adapt, focusing on critical thinking and creativity rather than rote memorization, as traditional skills become less valuable [9][10] - AI will transform work and life paradigms, leading to a collaborative era with AI acting as a second brain for individuals [10][11] - Access to top-tier professional services in fields like healthcare and law will become more equitable, allowing broader access to expert knowledge [11] Group 2: Cognitive Divide and AI Accessibility - Despite the productivity boost from AI, a cognitive divide is emerging, with 84% of the global population lacking AI exposure, risking marginalization for those who do not adopt AI tools [12][15] - The current penetration of AI among the population is low, with only about 16% having used free AI tools, indicating that the majority remain outside the technological benefits [12][15] Group 3: Technical Foundations of AI Large Models - The essence of large models lies in predicting the next word based on vast data, utilizing algorithms similar to human brain functions [16] - The breakthrough of the Transformer architecture in 2017 marked a significant advancement, allowing for parallel processing and efficient computation [17] - The emergence of capabilities in AI models occurs when parameters exceed a critical threshold, leading to enhanced reasoning and thinking abilities [18] Group 4: Future Development Directions - The focus is shifting from sheer computational power to algorithm optimization and sensory evolution, with a move towards more efficient models [20] - The industry is transitioning from a "power race" to an "architecture revolution," emphasizing algorithmic efficiency and multi-modal processing [20][21] - The global landscape of AI large models is consolidating, with leading companies establishing dominance through superior technology and data resources [21][29] Group 5: Trends in AI Large Models - Exponential demand for reasoning power will emerge as AI applications become mainstream, with significant increases in computational resource consumption [25] - Post-training will become essential for overcoming algorithmic bottlenecks, focusing on specific tasks and vertical scenarios [26] - The concept of world models will gain traction, enabling AI to understand physical laws and interact with real environments [27] - The concentration of AI capabilities will favor leading companies, particularly in China, which are positioned to dominate the global market [29] - Ensuring alignment between human values and AI decision-making will be critical as AI capabilities surpass human intelligence [30]
FlashAttention-4正式发布:算法流水线大改,矩阵乘法级速度
机器之心· 2026-03-06 04:31
Core Insights - FlashAttention-4 has officially launched after a year of development, marking a significant update in the deep learning optimization technology [1] - The core author, Tri Dao, highlights that the execution speed of the attention mechanism is now nearly as fast as matrix multiplication on Blackwell GPUs [1] Hardware Trends - The AI industry is rapidly transitioning to Blackwell architecture systems, such as B200 and GB200, which exhibit asymmetric hardware scaling [5] - The throughput of Tensor Cores has increased significantly, with a 2.25 times increase from Hopper H100 to Blackwell B200, while shared memory bandwidth remains relatively unchanged [6] Attention Mechanism Optimization - FlashAttention-4 aims to maximize the overlap between matrix multiplication and other bottleneck resources, achieving up to 1605 TFLOPs/s on B200 (BF16) with a utilization rate of 71% [10] - The new algorithm includes mechanisms to overcome bottlenecks, such as polynomial approximations for exponential functions and a new online softmax that avoids 90% of unnecessary rescaling [1][10] Collaborative Design Features - The design leverages Blackwell's new hardware features, including Tensor Memory (TMEM) and fully asynchronous fifth-generation Tensor Cores, to enhance performance [12] - The introduction of 2-CTA MMA allows for shared execution of UMMA operations across two CTA, reducing redundant data transfer and resource usage [13] Performance Benchmarking - FlashAttention-4 demonstrates superior performance in forward and backward passes compared to cuDNN 9.13 and Triton, with speed improvements of 1.1–1.3 times and 2.1–2.7 times, respectively [19] - The performance results indicate that FlashAttention-4 can significantly enhance the efficiency of attention mechanisms in long-sequence scenarios [19] Community Impact - The release of FlashAttention-4 has generated significant interest, with PyTorch announcing support for its backend, allowing researchers to prototype custom attention variants more efficiently [24][26] - Users can achieve performance improvements of 1.2 to 3.2 times compared to Triton under constrained workloads, eliminating the trade-off between flexibility and high performance [28]
DeepSeek更新后被吐槽变冷变傻:比20年前的青春伤感文学还让人尴尬!业内人士:这一版本类似于极速版,牺牲质量换速度
Mei Ri Jing Ji Xin Wen· 2026-02-12 16:42
Core Insights - DeepSeek has initiated a gray testing phase for its flagship model, allowing for a context length of up to 1 million tokens, significantly expanding from the previous 128K tokens in version 3.1 released in August last year [1][6] - User feedback indicates a shift in the model's interaction style, with complaints about a perceived loss of personality and warmth in responses, leading to a trending topic on social media regarding the model's "coldness" [1][4] - The upcoming version 4 of DeepSeek is expected to be released in mid-February 2026, with the current version being a speed-optimized iteration that sacrifices some quality for performance testing [6] User Experience - Users have reported that the model now refers to them as "users" instead of personalized nicknames, which has led to dissatisfaction regarding the emotional engagement of the model [4][5] - Some users feel that the model has become overly objective and rational, while others appreciate the increased focus on the user's psychological state rather than just the questions posed [5] Technical Developments - DeepSeek's V-series models are designed for optimal performance, with the V3 model marking a significant milestone due to its efficient MoE architecture [6][7] - Recent innovations include the mHC architecture for optimizing information flow in deep Transformers and the Engram memory module, which separates static knowledge from dynamic computation, reducing costs for long-context reasoning [7]
中国模型为何会在AI视频上领跑
Hua Er Jie Jian Wen· 2026-02-11 04:25
Core Insights - The emergence of ByteDance's Seedance 2.0 marks a significant shift, indicating that Chinese models in AI video are not just catching up but are leading the way [1] - Seedance 2.0 represents a deeper change in AI video, transforming it into a stable industrial product rather than a mere artistic endeavor [1] Group 1: Historical Context - Chinese companies have been gaining a clear lead in AI video for some time, with Kuaishou's Keling 2.0 achieving a 367% advantage over Sora in terms of character consistency and generation stability [2] - The stability of AI video is crucial, as it determines whether characters remain consistent and whether the generated results can be reliably reproduced [2] Group 2: Methodological Evolution - A number of Chinese companies have continued to advance along the same path, integrating video generation into workflows for e-commerce, advertising, and gaming [3] - The leading position of Chinese models in AI video is attributed to their focus on treating video generation as an engineering problem rather than merely enhancing model intelligence [3] Group 3: Technical Foundations - The concept of generating complex data through a process of destruction and reconstruction, leading to the development of Diffusion models, has been foundational in AI video generation [3][4] - Diffusion models excel at generating visually appealing content but lack an understanding of the sequence and causality of events, leading to disjointed video outputs [5][6] Group 4: Structural Understanding - The emergence of the Transformer architecture has provided a solution for understanding relationships and sequences in video, complementing the capabilities of Diffusion models [6][8] - A clear division of labor has emerged, with Transformers focusing on structural planning and Diffusion models on actual content generation [8][15] Group 5: Practical Applications - Chinese model teams have recognized that the core challenge in video generation lies in execution rather than mere generation, breaking down traditional filmmaking processes into model constraints [14][18] - This engineering approach has allowed for the optimization of content production pipelines, making AI video generation a reliable industrial capability rather than a mere artistic experiment [18][22] Group 6: Future Implications - The significance of Seedance 2.0 lies in its ability to stabilize the "prompt-generation-final product" process, making it a practical tool for users [20] - While Chinese models are still catching up in knowledge-intensive fields like large language models, they are leading in process-intensive areas like AI video due to their focus on engineering efficiency and scalable implementation [21][22]
清华联手千问重塑归一化范式,让 Transformer 回归「深度」学习
机器之心· 2026-02-10 11:03
Core Insights - The article introduces SiameseNorm, a novel architecture that reconciles the trade-offs between Pre-Norm and Post-Norm in Transformer models, enhancing both training stability and representation capacity [4][34]. Group 1: Background and Context - Siamese Twins, a term originating from the 19th-century Siamese brothers, has been adapted in neural networks to describe the concept of shared weights in Siamese Networks, which measure input similarity [2]. - In the 21st century, Pre-Norm and Post-Norm are identified as two critical paradigms in AI, aimed at improving the stability of large model training [2][3]. Group 2: Challenges with Existing Norms - Pre-Norm suffers from a "depth failure" issue, where deep parameters do not effectively contribute to the model's representation capability, limiting its "effective depth" [3]. - Post-Norm, while having higher potential for representation, introduces significant training instability, making it challenging to implement in modern Transformer pre-training paradigms [3][10]. Group 3: SiameseNorm Architecture - SiameseNorm employs a dual-stream architecture that decouples optimization dynamics, allowing for both Pre-Norm and Post-Norm characteristics to coexist without compromising on either [7][19]. - Each residual block in SiameseNorm receives combined gradients from both paradigms, achieving stable training at high learning rates without increasing computational costs [7][20]. Group 4: Experimental Validation - In experiments with a 1.3 billion parameter model, SiameseNorm demonstrated superior performance, achieving a perplexity (PPL) of 10.57, outperforming both Pre-Norm and Post-Norm architectures [22][25]. - Notably, in arithmetic tasks, SiameseNorm improved accuracy from 28.1% with Pre-Norm to 39.6%, marking a 40.9% relative increase, showcasing its ability to enhance model depth and reasoning capabilities [24]. Group 5: Mechanism Insights - Analysis revealed that both streams in SiameseNorm maintain significant weight contributions, indicating effective utilization of features from both Pre-Norm and Post-Norm [27]. - The Post-Norm stream plays a dominant role in final predictions, suggesting that it primarily enhances feature expression once the model stabilizes during training [31][32]. Group 6: Conclusion - SiameseNorm elegantly integrates the robustness of Pre-Norm with the expressive potential of Post-Norm, providing a clear path for developers aiming for higher learning rates and deeper networks in Transformer models [34].
大厂AI权力交接:90后,集体上位
虎嗅APP· 2026-02-03 13:52
Core Viewpoint - The article discusses the generational shift in leadership within major tech companies in China, particularly in the AI sector, highlighting the rise of younger leaders who are more attuned to the rapid advancements in AI technology and the decline of older executives whose experience is becoming less relevant in this fast-evolving landscape [4][30]. Group 1: Leadership Changes in Major Tech Companies - In major tech firms like Tencent and Alibaba, younger leaders, primarily from the 90s generation, are taking charge of AI initiatives, marking a significant shift in the industry [5][6]. - Tencent has recently appointed young talents such as Yao Shunyu and Pang Tianyu, who are seen as pivotal figures in the company's AI strategy, indicating a departure from traditional leadership models [5][6]. - Alibaba's Lin Junyang, a key figure behind the Qwen model, exemplifies the trend of younger leaders driving innovation and community engagement in AI [20][21]. Group 2: The Value of Experience vs. New Approaches - The article argues that traditional experience in tech is becoming less valuable in the face of new AI paradigms, where intuition and rapid adaptation are more critical than established practices [7][12]. - The "Transformer native generation" of young leaders is characterized by their early exposure to groundbreaking AI research, allowing them to navigate the complexities of modern AI development without the constraints of outdated methodologies [8][11]. - The rapid pace of AI advancements necessitates a shift in decision-making structures within companies, as seen in Tencent's restructuring to allow younger leaders to report directly to top executives [16][18]. Group 3: The Role of Community and Open Source - The younger generation, including leaders like Lin Junyang, understands the importance of community and open-source collaboration in the AI landscape, which contrasts with the more insular approaches of previous generations [20][21]. - The success of models like Qwen is attributed to their strong community engagement, reflecting a shift in competitive strategies within the AI sector [20][21]. Group 4: ByteDance's Unique Approach - ByteDance's strategy differs from Tencent and Alibaba by integrating experienced leaders like Wu Yonghui, who brings a wealth of knowledge from Google, focusing on system-level integration rather than just innovation [22][24]. - This approach highlights ByteDance's need for cohesive integration of AI capabilities across its various platforms, contrasting with the more experimental focus of its competitors [24][25]. Group 5: The Inevitable Power Transition - The transition to younger leadership is framed as a natural evolution driven by the rapid pace of knowledge acquisition in AI, where traditional experience is overshadowed by the need for innovative thinking [27][30]. - The article emphasizes that in the AI era, the ability to adapt and understand new technologies is more crucial than accumulated experience, marking a significant shift in workplace dynamics [31][30].
AI来了,大厂为什么留不住高管? | 巴伦精选
Tai Mei Ti A P P· 2026-01-26 10:44
Core Insights - The article discusses the transition of tech executives from large companies to startups, driven by the AI revolution and the limitations of traditional corporate structures [2][5][24] - It highlights the emergence of two waves of entrepreneurs: the "tech believers" focused on model development and the "business translators" who prioritize commercialization [17][20] Group 1: Reasons for Departure - Executives are leaving large firms due to structural conflicts between established corporate cultures and the innovative demands of AI development [5][9] - The rise of AI technologies, particularly the Transformer architecture, has prompted many to seek opportunities outside their companies, where they can pursue innovative projects without bureaucratic constraints [5][6] - The decision-making processes in large firms often hinder rapid innovation, leading talented individuals to pursue entrepreneurial ventures where they can explore new ideas more freely [11][12] Group 2: Characteristics of Departing Executives - The departing executives often possess deep technical knowledge and a strong understanding of AI, making them valuable assets in the startup ecosystem [17][25] - They have the ability to integrate resources and build teams, which is crucial for the collaborative nature of AI projects [25] - Their insights into industry needs and market demands position them well to identify and capitalize on new business opportunities [25][26] Group 3: Challenges Faced by Large Firms - Large companies struggle to retain talent due to lengthy decision-making processes and a culture that prioritizes risk minimization over opportunity maximization [10][11] - Despite offering attractive compensation packages, these firms fail to address the underlying issues related to organizational structure and innovation [10][12] - The inability to provide a conducive environment for experimentation and risk-taking further exacerbates talent retention challenges [12][13] Group 4: Investment Trends - Investors are increasingly favoring executives with backgrounds in major tech firms, viewing them as reliable indicators of potential success in the uncertain AI landscape [24][25] - The shift in investment focus reflects a broader trend where capital seeks to mitigate risks associated with new technologies by backing experienced leaders [24][26] - The emergence of a "hunting mechanism" among investors highlights the proactive approach to identifying and supporting promising talent from large companies [27][28]
哈佛辍学“三剑客”,做AI芯片,刚刚融了35亿
创业邦· 2026-01-24 04:10
Core Viewpoint - The rise of specialized chips, particularly ASICs designed for AI models based on the Transformer architecture, is challenging the dominance of general-purpose GPUs like those from NVIDIA. Etched.ai, a startup founded by Harvard dropouts, has recently raised $500 million, bringing its valuation close to $5 billion, and aims to revolutionize the AI hardware landscape with its dedicated chips [4][19]. Company Overview - Etched.ai was founded by Gavin Uberti, Chris Zhu, and Robert Wachen, all of whom dropped out of Harvard to focus on developing ASIC chips specifically for Transformer models, distinguishing themselves from general-purpose GPU manufacturers [4][8]. - The company has attracted significant talent from the semiconductor industry, including experts from Intel and other tech giants, to enhance its capabilities in chip design and development [13]. Technology and Product - The flagship product, the Sohu chip, is designed to run Transformer models with significantly higher efficiency than general-purpose GPUs, achieving a hardware utilization rate of 90% compared to the average 30% for GPUs [18][22]. - The Sohu chip's performance is equivalent to 160 NVIDIA H100 GPUs while consuming less power, making it a more economical and efficient choice for enterprises needing specialized AI processing [18]. Market Position and Strategy - Etched.ai aims to capture a niche in the AI inference market by focusing solely on the Transformer architecture, which is expected to dominate the AI landscape. This strategy allows for optimized performance and reduced energy consumption [15][22]. - The company has successfully raised multiple rounds of funding, indicating strong investor confidence in its technology and market potential. The latest funding round was led by Stripes Group and included notable investors like Peter Thiel and Palantir [19][20]. Competitive Landscape - The emergence of specialized chip companies like Etched.ai, Groq, and others represents a shift in the industry, where the focus is moving towards dedicated AI accelerators rather than general-purpose GPUs. This trend is driven by the realization that most computational power is being used for similar model architectures [22][23]. - Etched.ai is positioned among a new wave of companies that are challenging established players like NVIDIA by offering chips that are specifically optimized for AI workloads, particularly in inference tasks [23][27].