Transformer

Search documents
Flash Attention作者最新播客:英伟达GPU统治三年内将终结
量子位· 2025-09-29 04:57
在最新播客《Unsupervised Learning》中,Tri Dao分享了对GPU市场、推理成本、模型架构以及AI未来趋势的深度洞察,并针对上述"暴 论"展开了有理有据的分析: Tri Dao不仅是 Flash Attention 的作者,而且还是 Mamba 的作者之一。 henry 发自 凹非寺 量子位 | 公众号 QbitAI 英伟达还能"猖狂"多久?——不出三年! 实现AGI需要新的架构吗?——不用,Transformer足矣! "近几年推理成本下降了100倍,未来还有望再降低10倍!" 这些"暴论",出自 Flash Attention 的作者—— Tri Dao 。 同时,他也是TogetherAI的首席科学家、普林斯顿大学教授。 《Semi Analysis》曾盛赞他在英伟达生态中的贡献,是其护城河的重要组成部分。 …… 未来2-3年内,随着针对不同工作负载类别的专用芯片出现——包括低延迟的智能体系统、高吞吐量的批量处理以及互动式聊天机器人 —— AI硬件格局将从NVIDIA当前约90%的主导地位,转向更加多元化的生态系统。 MoE架构、推理优化、模型量化、模型架构和硬件的协同设计等技术促 ...
The First Transformable Dual-Screen Gaming Handheld Is $700
CNET· 2025-09-24 04:16
I wanted to especially if I wanted to pop this screen under. No. >> Let's say yes.>> But what a device. Plus, it's very cool. >> The 1x sugar.Yeah, it's nice to play with a transformer. ...
谢赛宁回忆七年前OpenAI面试:白板编程、五小时会议,面完天都黑了
机器之心· 2025-08-29 09:53
Core Insights - The article discusses the unique interview experiences of AI researchers at major tech companies, highlighting the differences in interview styles and the focus areas of these companies [1][9][20]. Group 1: Interview Experiences - Lucas Beyer, a researcher with extensive experience at top AI firms, initiated a poll about memorable interview experiences at companies like Google, Meta, and OpenAI [2][20]. - Saining Xie shared that his interviews at various AI companies were unforgettable, particularly noting the rigorous two-hour marathon interview at DeepMind, which involved solving over 100 math and machine learning problems [5][6]. - The interview process at Meta was described as more academic, focusing on discussions with prominent researchers rather than just coding [6][7]. Group 2: Company-Specific Insights - The interview style at Google Research was likened to an academic job interview, with a significant emphasis on research discussions rather than solely on coding challenges [7]. - OpenAI's interview process involved a lengthy session focused on a reinforcement learning problem, showcasing the company's commitment to deep research engagement [8][9]. - The article notes that the interview questions reflect the research priorities of these companies, such as Meta's focus on computer vision and OpenAI's emphasis on reinforcement learning [9][20]. Group 3: Notable Interviewers and Candidates - Notable figures like John Schulman and Noam Shazeer were mentioned as interviewers, indicating the high caliber of talent involved in the hiring processes at these firms [7][9]. - Candidates shared memorable moments from their interviews, such as solving complex problems on napkins or engaging in deep discussions about research topics [19][20].
新一轮智驾PK,迈入实战时刻
Hu Xiu· 2025-08-27 10:38
Core Viewpoint - The recent surge in advancements in intelligent driving technology is driven by several key factors, leading to a new competitive landscape in the industry [2][3][20]. Group 1: Industry Dynamics - Major players in the intelligent driving sector have recently launched their latest capabilities, reminiscent of past industry waves driven by end-to-end models [2][3]. - The competition has intensified due to regulatory pressures and public sentiment, which have delayed some companies' planned releases [6][20]. - Companies are prioritizing the release of "basic versions" of their technologies to avoid being outpaced by competitors [6][20]. Group 2: Technological Advancements - The VLA model has surpassed the capabilities of previous end-to-end models, with its lower limits exceeding the upper limits of earlier technologies [3][5]. - The transition from CNN to Transformer technology has significantly enhanced the model's ability to mimic human cognitive processes [8][10]. - The VLA model incorporates a chain of thought (CoT) capability, allowing for more logical and reliable decision-making in complex driving scenarios [11][12]. Group 3: Model Features and Performance - The VLA model's architecture is designed to handle perception, reasoning, and action execution, making it more suitable for intelligent driving applications compared to previous models [9][10]. - The model's ability to analyze and respond to potential risks has improved, enabling it to exhibit a form of "defensive driving" akin to human behavior [12][13][17]. - The VLA model can interpret traffic signs and respond to verbal commands, enhancing user interaction and operational efficiency [17][20]. Group 4: Future Outlook - The VLA model is still in the optimization phase, with significant improvements needed to fully realize its CoT capabilities [18][19]. - The industry is expected to focus on data collection and refining the performance of intelligent driving models in the coming period [19][20]. - The cost of implementing VLA technology varies, with higher-end models being more adaptable, while lower-cost models may require further optimization [20].
DiT突遭怒喷,谢赛宁淡定回应
量子位· 2025-08-20 07:48
Core Viewpoint - The article discusses the recent criticisms of the DiT (Diffusion Transformers) model, which is considered a cornerstone in the diffusion model field, highlighting the importance of scientific scrutiny and empirical validation in research [3][10]. Group 1: Criticism of DiT - A user has raised multiple concerns about DiT, claiming it is flawed both mathematically and in its structure, even questioning the presence of Transformer elements in DiT [4][12]. - The criticisms are based on a paper titled "TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training," which introduces a strategy that allows early-layer tokens to be passed to deeper layers without modifying the architecture or adding parameters [12][14]. - The critic argues that the rapid decrease in FID (Fréchet Inception Distance) during training indicates that DiT's architecture has inherent properties that allow it to easily learn the dataset [15]. - The Tread model reportedly trains 14 times faster than DiT after 400,000 iterations and 37 times faster at its best performance after 7 million iterations, suggesting that significant performance improvements may undermine previous methods [16][17]. - The critic also suggests that if parts of the network are disabled during training, it could render the network ineffective [19]. - It is noted that the more network units in DiT that are replaced with identity mappings during training, the better the model evaluation results [20]. - The architecture of DiT is said to require logarithmic scaling to represent the signal-to-noise ratio differences during the diffusion process, indicating potential issues with output dynamics [23]. - Concerns are raised regarding the Adaptive Layer Normalization method, suggesting that DiT processes conditional inputs through a standard MLP (Multi-Layer Perceptron) without clear Transformer characteristics [25][26]. Group 2: Response from Xie Saining - Xie Saining, the author of DiT, responded to the criticisms, asserting that the Tread model's findings do not invalidate DiT [27]. - He acknowledges the Tread model's contributions but emphasizes that its effectiveness is due to regularization enhancing feature robustness, not because DiT is incorrect [28]. - Saining highlights that Lightning DiT, an upgraded version of DiT, remains a powerful option and should be prioritized when conditions allow [29]. - He also states that there is no evidence to suggest that the post-layer normalization in DiT causes issues [30]. - Saining summarizes improvements made over the past year, focusing on internal representation learning and various methods for enhancing model training [32]. - He mentions that the sd-vae (stochastic depth variational autoencoder) is a significant concern for DiT, particularly regarding its high computational cost for processing images at 256×256 resolution [34].
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].
端到端VLA的起点:聊聊大语言模型和CLIP~
自动驾驶之心· 2025-08-19 07:20
Core Viewpoint - The article discusses the development and significance of end-to-end (E2E) algorithms in autonomous driving, emphasizing the integration of various advanced technologies such as large language models (LLMs), diffusion models, and reinforcement learning (RL) in enhancing the capabilities of autonomous systems [21][31]. Summary by Sections Section 1: Overview of End-to-End Autonomous Driving - The first chapter provides a comprehensive overview of the evolution of end-to-end algorithms, explaining the transition from modular approaches to end-to-end solutions, and discussing the advantages and challenges of different paradigms [40]. Section 2: Background Knowledge - The second chapter focuses on the technical stack associated with end-to-end systems, detailing the importance of LLMs, diffusion models, and reinforcement learning, which are crucial for understanding the future job market in this field [41][42]. Section 3: Two-Stage End-to-End Systems - The third chapter delves into two-stage end-to-end systems, exploring their emergence, advantages, and disadvantages, while also reviewing notable works in the field such as PLUTO and CarPlanner [42][43]. Section 4: One-Stage End-to-End and VLA - The fourth chapter highlights one-stage end-to-end systems, discussing various subfields including perception-based methods and the latest advancements in VLA (Vision-Language Alignment), which are pivotal for achieving the ultimate goals of autonomous driving [44][50]. Section 5: Practical Application and RLHF Fine-Tuning - The fifth chapter includes a major project focused on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, providing practical insights into building pre-training and reinforcement learning modules, which are applicable to VLA-related algorithms [52]. Course Structure and Learning Outcomes - The course aims to equip participants with a solid understanding of end-to-end autonomous driving technologies, covering essential frameworks and methodologies, and preparing them for roles in the industry [56][57].
马斯克:谷歌最有可能成为AI行业领先者
3 6 Ke· 2025-08-15 01:21
Core Viewpoint - Elon Musk praised Google as the most likely leader in the artificial intelligence industry due to its significant computational and data advantages, while also indicating that this situation may change in the coming years [1]. Group 1: Google's Position in AI - Google is recognized as a cornerstone in the AI field, having introduced the Transformer concept in a groundbreaking 2017 paper titled "Attention Is All You Need," which supports large language models like ChatGPT [1]. - Google has invested in AI startups such as Anthropic and Safe Superintelligence, holding a 14% stake in Anthropic [1]. - In its recent Q2 earnings call, Google announced a $10 billion increase in capital expenditures for the year, raising the total to $85 billion, aimed at meeting the growing demand for chips and AI products [2]. Group 2: Musk's Disputes and xAI - Musk's comments coincided with escalating tensions with OpenAI's CEO Sam Altman, including Musk's threat to sue Apple for promoting OpenAI's ChatGPT over his own xAI's Grok chatbot [3]. - Musk and Altman co-founded OpenAI in 2015, but their relationship soured after Musk left the board in 2018 due to differing views [3]. - Musk established xAI in July 2023, raising over $12 billion in funding across multiple rounds later that year [3]. - Tesla is set to hold a shareholder vote on whether to invest in xAI, although the timing of the vote has not been specified [4]. - Musk indicated that if it were solely his decision, Tesla would have already invested in xAI [5].
又是王冠:27M小模型超越o3-mini!拒绝马斯克的00后果然不同
Sou Hu Cai Jing· 2025-08-10 04:21
Core Insights - The article discusses the development of a new AI model called the Hierarchical Reasoning Model (HRM) by Sapient Intelligence, which has achieved superior performance compared to larger models with fewer parameters [3][5][18] - HRM utilizes innovative techniques inspired by brain functions, allowing it to perform complex reasoning tasks efficiently without relying on traditional pre-training methods [4][12][14] Model Performance - HRM, with only 27 million parameters, surpassed larger models like o3-mini-high and Claude 3.7 in various tests, achieving a 40.3% accuracy rate in the ARC-AGI challenge [16][18] - In extreme Sudoku tasks, HRM demonstrated near-perfect accuracy, while traditional models struggled significantly [16][18] Technical Innovations - HRM employs a dual-layer cyclical module design that mimics the brain's hierarchical processing and time-scale separation, enhancing both global direction and local execution efficiency [4][7] - The model incorporates a layered convergence mechanism to avoid premature convergence, allowing it to adaptively set new goals based on high-level updates [9][11] - It utilizes approximate gradient techniques to optimize memory usage and computational efficiency, aligning with biological learning patterns [12] - A deep supervision mechanism is integrated, allowing for periodic evaluations and adjustments during the learning process, which helps in correcting deviations promptly [13][14] Developer Background - The model's creator, Wang Guan, is a young entrepreneur who previously declined offers from major tech figures like Elon Musk, aiming instead to revolutionize AI architecture [20][22] - Wang co-founded Sapient Intelligence in 2024, focusing on developing models with advanced reasoning and planning capabilities [22]
自动驾驶之心技术交流群来啦!
自动驾驶之心· 2025-07-29 07:53
Core Viewpoint - The article emphasizes the establishment of a leading communication platform for autonomous driving technology in China, focusing on industry, academic, and career development aspects [1]. Group 1 - The platform, named "Autonomous Driving Heart," aims to facilitate discussions and exchanges among professionals in various fields related to autonomous driving technology [1]. - The technical discussion group covers a wide range of topics including large models, end-to-end systems, VLA, BEV perception, multi-modal perception, occupancy, online mapping, 3DGS, multi-sensor fusion, transformers, point cloud processing, SLAM, depth estimation, trajectory prediction, high-precision maps, NeRF, planning control, model deployment, autonomous driving simulation testing, product management, hardware configuration, and AI job exchange [1]. - Interested individuals are encouraged to join the community by adding a WeChat assistant and providing their company/school, nickname, and research direction [1].