Workflow
FlashInfer
icon
Search documents
剖析英~1
2026-04-01 09:59
Dissecting Nvidia Blackwell - Tensor Cores, PTX Instructions, SASS, Floorsweep, Yield 剖析英伟达 Blackwell 架构——张量核心、PTX 指令、SASS、残次品利用、良率 Microbenchmarking, tcgen05, 2SM MMA, UMMA, TMA, LDGSTS, UBLKCP, Speed of Light, Distributed Shared Memory, GPC Floorsweeps, SM Yield 微基准测试、tcgen05、2SM MMA、UMMA、TMA、LDGSTS、UBLKCP、光速指标、 分布式共享内存、GPC 残次品利用、SM 良率 KIMBO CHEN AND DYLAN PATEL KIMBO CHEN 与 DYLAN PATEL APR 01, 2026 2026 年 4 月 1 日 ∙ PAID ∙ 付费内容 136 1 Share 分享 Nvidia's Datacenter Blackwell GPU (SM100) represents one of the ...
叶子豪、陈天奇等人开源项目FlashInfer入选,MLSys2025最佳论文奖公布
机器之心· 2025-05-14 04:36
Core Insights - The article highlights the recognition of two outstanding papers in the field of machine learning systems, both authored by Chinese researchers, awarded at the MLSys 2025 conference [1][29]. Group 1: FlashInfer - FlashInfer, a collaborative research project initiated by the University of Washington, Carnegie Mellon University, and OctoAI, aims to create a flexible inference kernel library for large language models (LLMs) [4]. - NVIDIA has integrated FlashInfer's capabilities into various projects, enhancing LLM inference performance [2]. - FlashInfer significantly improves computational performance in various inference scenarios, reducing inter-token latency by 29% to 69% compared to state-of-the-art LLM deployment solutions [7]. - The system employs a block-sparse format and composable formats to optimize memory access and reduce redundancy in key-value cache storage [9][11]. - FlashInfer supports Just-In-Time (JIT) compilation for customizable attention computation templates, allowing flexibility for different application needs [9][20]. - The system's design includes a load-balancing scheduling algorithm to adapt to dynamic user requests while maintaining compatibility with static configurations [9][26]. Group 2: The Hidden Bloat in Machine Learning Systems - The second awarded paper discusses software bloat in machine learning systems, which refers to unused code and functionalities that lead to performance degradation and resource waste [31]. - The proposed method, Negativa-ML, identifies and eliminates bloat in ML frameworks by analyzing shared libraries, achieving an average reduction of device code size by up to 75% and host code size by up to 72% [32]. - By reducing bloat, Negativa-ML can decrease peak host memory usage, peak GPU memory usage, and execution time by up to 74.6%, 69.6%, and 44.6%, respectively [32].