Workflow
FlashMoBA
icon
Search documents
MIT天才博士刚毕业,就被前OpenAI CTO抢走,年薪或300万起步
3 6 Ke· 2026-01-09 08:12
Core Insights - Guangxuan Xiao, a PhD graduate from MIT, has officially joined Thinking Machines to focus on pre-training large models [1][6][10] - His academic background includes dual degrees from Tsinghua University in Computer Science and Finance, along with numerous awards and research experiences [6][8][10] Group 1: Academic and Professional Background - Guangxuan Xiao graduated from Tsinghua University with dual degrees in Computer Science and Finance, receiving multiple prestigious awards during his studies [6][8] - He completed his PhD at MIT under the supervision of Professor Song Han, focusing on efficient algorithms and systems for large language models [10][18] - Xiao has interned at major tech companies, including Meta and NVIDIA, where he contributed to research on efficient attention mechanisms and large language model optimization [10][12][18] Group 2: Research Contributions - Xiao's doctoral thesis addresses significant challenges in large language models, proposing solutions for issues like memory overflow and slow inference [18][19] - His research introduced SmoothQuant, achieving lossless quantization for billion-parameter models without retraining, and enabling constant memory streaming inference for long sequences [19][20] - The thesis also includes innovative approaches like DuoAttention and XAttention, which enhance performance while reducing memory usage [19][20] Group 3: Company Insights - Thinking Machines offers competitive salaries, with average base salaries reaching $500,000, significantly higher than those at established companies like OpenAI and Anthropic [21][25] - The company is positioned to attract top talent in the AI field, reflecting the ongoing talent war in Silicon Valley [21][28]
韩松等提出FlashMoBA,比MoBA快7.4倍,序列扩到512K也不会溢出
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the introduction of a novel attention mechanism called Mixture of Block Attention (MoBA), which applies the principles of mixture of experts (MoE) to attention mechanisms, allowing models to autonomously determine which positions to focus on [2][4] - MoBA shows significant potential in handling long contexts by allowing queries to sparsely attend to a limited number of key-value blocks, thereby greatly reducing computational costs [3][4] - The article identifies performance challenges associated with smaller block sizes in MoBA implementations and introduces FlashMoBA, a hardware-friendly CUDA kernel designed to efficiently execute MoBA under small block configurations [7][12] Performance Analysis - The original MoBA implementation struggles with performance bottlenecks when using smaller block sizes, leading to slower execution compared to dense attention [11][41] - FlashMoBA optimizes MoBA's performance, achieving up to 14.7 times speedup compared to FlashAttention-2 in small block scenarios [8][43] - The article presents experimental results showing that reducing block size from 512 to 128 improves perplexity from 20.9 to 19.7 and RULER accuracy from 38.8% to 56.0% for a 340M parameter model [30][31] Technical Improvements - The article outlines two main improvement paths for MoBA: using smaller block sizes and applying short convolutions on keys to enhance routing accuracy [5][36] - FlashMoBA employs a three-kernel design to minimize memory access inefficiencies and align computations with GPU architecture, significantly improving performance [16][21] - The forward kernel uses a "collect and densify" strategy to handle MoBA's irregular sparsity, which is crucial for efficient computation [22][26] Experimental Results - The article details experiments conducted on 8× H100 80GB GPUs, demonstrating that the optimized MoBA model outperforms dense attention mechanisms across various benchmarks [30][39] - Key convolution techniques (kconv3 and kconv5) are shown to enhance model performance, with kconv3 improving language modeling accuracy from 45.1% to 45.6% for a 340M model [36][37] - Overall, the results indicate that smaller block sizes are essential for MoBA to achieve performance comparable to dense attention [41][42]