大语言模型推理 - filings, earnings calls, financial reports, news

大语言模型推理

Search documents

Xin Lang Ke Ji· 2025-05-27 03:42

Group 1 - Red Hat has launched a new open-source project called llm-d to meet the large-scale inference demands of generative AI, collaborating with CoreWeave, Google Cloud, IBM Research, and NVIDIA [1][3] - According to Gartner, by 2028, over 80% of data center workload accelerators will be deployed specifically for inference rather than training, indicating a shift in resource allocation [3] - The llm-d project aims to integrate advanced inference capabilities into existing enterprise IT infrastructure, addressing the challenges posed by increasing resource demands and potential bottlenecks in AI innovation [3] Group 2 - The llm-d platform allows IT teams to meet various service demands for critical business workloads while maximizing efficiency and significantly reducing the total cost of ownership associated with high-performance AI accelerators [3] - The project has garnered support from a coalition of generative AI model providers, AI accelerator pioneers, and major AI cloud platforms, indicating deep collaboration within the industry to build large-scale LLM services [3] - Key contributors to the llm-d project include CoreWeave, Google Cloud, IBM Research, and NVIDIA, with partners such as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI [3][4] Group 3 - Google Cloud emphasizes the importance of efficient AI inference in the large-scale deployment of AI to create value for users, highlighting its role as a founding contributor to the llm-d project [4] - NVIDIA views the llm-d project as a significant addition to the open-source AI ecosystem, supporting scalable and high-performance inference as a key to the next wave of generative and agent-based AI [4] - NVIDIA is collaborating with Red Hat and other partners to promote community engagement and industry adoption of the llm-d initiative, leveraging innovations like NIXL to accelerate its development [4]

以加代乘？华为数学家出手，昇腾算子的高能设计与优化，性能提升30%！

机器之心· 2025-05-23 04:17

Core Viewpoint - The article discusses the rapid advancements in large language models (LLMs) and the challenges they face in inference, particularly regarding speed and energy efficiency. It highlights Huawei's innovative solutions to optimize these models through hardware-software integration, focusing on three key technologies that enhance inference speed and energy efficiency [2][4][11]. Group 1: Key Technologies - AMLA technology transforms complex multiplication into addition operations, significantly increasing chip utilization rates to 71% and improving performance by over 30% in the attention operator [4][5]. - The fusion operator optimization combines multiple operators into a single composite operator, enhancing parallel processing and reducing redundant data movement, leading to substantial performance improvements in model inference [7][9]. - SMTurbo technology enables ultra-low latency memory sharing across 384 cards, achieving sub-microsecond delays and enhancing memory access throughput by over 20% in cross-machine communication scenarios [10][9]. Group 2: Future Developments - Future research on AMLA will focus on optimizing the MLA operator for quantization scenarios, expanding its application [12]. - The fusion operator optimization will explore its application across more model architectures, promoting efficient inference of large language models on Huawei's Ascend hardware [12]. - Load/Store optimization will balance read and write loads, aiming for practical benefits in large batch sizes within Deepseek dispatch and combine scenarios [12].

大语言模型推理

算子优化

Telecommunications Equipment

Telecommunications Equipment

昇腾芯片

CloudMatrix 384

叶子豪、陈天奇等人开源项目FlashInfer入选，MLSys2025最佳论文奖公布

机器之心· 2025-05-14 04:36

Core Insights - The article highlights the recognition of two outstanding papers in the field of machine learning systems, both authored by Chinese researchers, awarded at the MLSys 2025 conference [1][29]. Group 1: FlashInfer - FlashInfer, a collaborative research project initiated by the University of Washington, Carnegie Mellon University, and OctoAI, aims to create a flexible inference kernel library for large language models (LLMs) [4]. - NVIDIA has integrated FlashInfer's capabilities into various projects, enhancing LLM inference performance [2]. - FlashInfer significantly improves computational performance in various inference scenarios, reducing inter-token latency by 29% to 69% compared to state-of-the-art LLM deployment solutions [7]. - The system employs a block-sparse format and composable formats to optimize memory access and reduce redundancy in key-value cache storage [9][11]. - FlashInfer supports Just-In-Time (JIT) compilation for customizable attention computation templates, allowing flexibility for different application needs [9][20]. - The system's design includes a load-balancing scheduling algorithm to adapt to dynamic user requests while maintaining compatibility with static configurations [9][26]. Group 2: The Hidden Bloat in Machine Learning Systems - The second awarded paper discusses software bloat in machine learning systems, which refers to unused code and functionalities that lead to performance degradation and resource waste [31]. - The proposed method, Negativa-ML, identifies and eliminates bloat in ML frameworks by analyzing shared libraries, achieving an average reduction of device code size by up to 75% and host code size by up to 72% [32]. - By reducing bloat, Negativa-ML can decrease peak host memory usage, peak GPU memory usage, and execution time by up to 74.6%, 69.6%, and 44.6%, respectively [32].