大模型推理

Search documents
劲爆!高盛上调寒武纪目标价1835元,“寒王”市值超五粮液股价超茅台?85后创始人陈天石身价超1500亿,大佬章建平火了!
Sou Hu Cai Jing· 2025-08-25 02:37
受英伟达暂停H20芯片生产和DeepSeek"点燃"国产芯片双重利好刺激,上周A股芯片板块成为市场最强方向,其中龙头寒武纪炙手可热。 上周五涨停之后,股价高达1243元,在A股仅次于贵州茅台;市值超5200亿元,超过美的集团、五粮液、东方财富、浦发银行、中信证券、恒瑞医药等一众 知名公司。 据分析,市场炒作寒武纪主要围绕三大核心逻辑展开。首先是国产替代加速,随着地缘政治因素影响,国内云厂商和互联网大厂对自主可控AI芯片的需求 快速增长,寒武纪直接受益。第二是大模型推理需求爆发,以DeepSeek等为代表的本土大模型快速发展,带动了对高性能AI推理芯片的旺盛需求。第三是 行业龙头地位,被称为"中国英伟达"的寒武纪在AI芯片架构设计和软硬件协同优化方面的技术积累逐步显现价值,资本层面定增申请获上交所审核通过, 39.85亿元募资将投入大模型芯片及软件平台建设。 近期,英伟达暂停H20芯片生产的消息进一步催化了国产AI芯片替代需求的激增,DeepSeek-V3.1正式发布并适配国产芯片架构,寒武纪们想象力空间彻底 打开。 值得一提的是,寒武纪创始人陈天石为80后,出生于1985年,毕业于中科大。如今随着寒武纪股价的 ...
"六边形战士"GPU公司完成亿元新融资
是说芯语· 2025-08-24 01:39
经过8年持续的技术研发与产品迭代,芯动力科技已建立起完整的AI计算产品矩阵。 其核心技术为 可重构并行处理器架构(简称"RPP") ,是自主研发专为并行计算设计的处理器架构。 自研AI芯片已适配主流开源大模型。 8月21日消息,清华系创新架构算力芯片企业「珠海市芯动力科技有限公司」(简称"芯动力科技")近日 完成 近亿元 B2轮融资,由飞图创投领投。 所获资金将重点投入RPP芯片产业化推进、核心技术研发升级以及边缘计算和AI芯片推理市场的加速拓 展。 芯动力科技成立于2017年,已在珠海、深圳、西安及美国设立研发中心。此前在今年3月,该公司宣布 完成数千万元B1轮融资,由长石资本领投,达泰资本、江门长信、硕明等机构跟投。 M.2加速卡拥有高达32TOPS的算力以及60GB/s的内存带宽,功耗可动态控制,同时可支撑大模型在笔记 本电脑等设备上运行,适配了DeepSeek、Llama3-8B、Stable Diffusion、通义千问,BitNet等开源模 型。 随着新一轮融资完成,芯动力科技计划围绕打造我国自有产权高端通用型芯片的发展方向前行。 转自:芯东西 加入"中国IC独角兽联盟",请点击进入 是说芯语转载 ...
大华股份(002236):服务器业务有望开启新增长点
HTSC· 2025-08-19 02:04
Investment Rating - The investment rating for the company is maintained as "Buy" with a target price of RMB 28.56 [1][6]. Core Views - The company is expected to open new growth avenues in its server business, particularly with the increasing demand for AI computing power [8][9]. - The company has successfully entered the procurement systems of major clients, which is anticipated to enhance its brand influence in the computing power industry [9][12]. - The overall performance in the first half of 2025 shows positive growth across all business lines, with a significant increase in profitability and cash flow [15][16]. Financial Data Summary - The company's market capitalization is RMB 59,786 million, with a closing price of RMB 18.19 as of August 18, 2025 [2]. - Revenue projections for 2024 to 2027 are RMB 32,181 million, RMB 33,275 million, RMB 35,165 million, and RMB 38,002 million respectively, with growth rates of -0.12%, 3.40%, 5.68%, and 8.07% [5]. - The net profit attributable to the parent company is projected to be RMB 2,906 million in 2024, increasing to RMB 4,208 million by 2027, with corresponding growth rates of -60.53%, 31.91%, 1.28%, and 8.39% [5]. Business Performance Overview - In the first half of 2025, the company achieved a revenue of RMB 151.81 billion, representing a year-on-year growth of 2.12%, with a net profit of RMB 24.76 billion, up 36.80% [15][16]. - The G-end business generated RMB 18.51 billion in revenue, growing 4.68%, while the B-end business saw revenue of RMB 42.19 billion, up 8.17% [10][16]. - The overseas business accounted for 50.25% of total revenue, with a slight growth of 1.91% year-on-year [10][16]. Future Outlook - The company anticipates steady growth in the second half of 2025, focusing on policy opportunities and expanding overseas markets [11][17]. - The server business is expected to benefit from the rising demand for AI and computing power, with significant contracts already secured [9][12].
链式思维是幻象吗?从数据分布视角重新审视大模型推理,马斯克回复,Grok破防
机器之心· 2025-08-14 09:11
Core Viewpoint - The research suggests that Chain-of-Thought (CoT) reasoning in large language models (LLMs) may not represent true reasoning but rather a replication of patterns learned from training data, leading to fragility when faced with out-of-distribution tasks [2][10][37]. Data Distribution Perspective on CoT - The effectiveness of CoT is attributed to the "structured inductive bias" learned within the training distribution, indicating that the reasoning chains are merely reproductions of common patterns rather than genuine logical deductions [13][37]. - A theoretical framework is introduced to quantify the relationship between training and testing distributions, highlighting how distribution shifts can impact reasoning performance [15]. Experimental Findings on Generalization - In "task generalization," the model shows nearly 100% accuracy within the training distribution, but accuracy drops to 0.01% with slight distribution shifts, indicating a lack of true generalization [23]. - Supervised fine-tuning on a small amount of new data can restore performance, but this only expands the existing distribution boundaries without enhancing abstract generalization capabilities [24]. - In "length generalization," even minor changes in input sequence length significantly affect model performance, demonstrating a tendency to generate reasoning chains consistent with training lengths [26]. - The model is highly sensitive to format changes, with even minor alterations in input prompts leading to complete reasoning failures [28]. Universal Sensitivity to Distribution Shifts - The study finds that the sensitivity to distribution shifts is a common phenomenon across different sampling temperatures and model sizes, indicating that this issue is not isolated to specific models [31]. Practical Implications - In high-risk fields such as healthcare and finance, reliance on CoT for robust reasoning is cautioned against, as misleading reasoning chains can be more dangerous than outright incorrect answers [34]. - Current evaluation methods that depend on validation sets closely aligned with training distributions may overestimate model robustness, necessitating stricter out-of-distribution testing [35]. - While supervised fine-tuning can quickly enhance performance on specific tasks, it does not equip models with true abstract reasoning capabilities [36].
对话后摩智能CEO吴强:未来90%的数据处理可能会在端边
Guan Cha Zhe Wang· 2025-07-30 06:41
Core Insights - The World Artificial Intelligence Conference (WAIC 2025) highlighted the development of domestic computing power chips, particularly the M50 chip from Houmo Intelligence, designed for large model inference in AI PCs and smart terminals [1][4] - Houmo Intelligence's CEO, Wu Qiang, emphasized a shift in the focus of large models from training to inference, and from cloud intelligence to edge and endpoint intelligence [1][4] Company Overview - Houmo Intelligence was founded in 2020, focusing on high-performance AI chip development based on integrated storage and computing technology [3] - The M50 chip is seen as a significant achievement for Houmo Intelligence, showcasing their advancements over the past two years [3] Product Specifications - The M50 chip delivers 160 TOPS INT8 and 100 TFLOPS bFP16 physical computing power, with a maximum memory of 48GB and a bandwidth of 153.6 GB/s, while maintaining a typical power consumption of only 10W [4] - The product matrix from Houmo Intelligence covers a range of computing solutions from edge to endpoint, including the LQ50 Duo M.2 card for AI PCs and companion robots [4] Market Positioning - Wu Qiang stated that domestic companies should adopt differentiated technological paths rather than directly copying international giants like NVIDIA and AMD [4] - Houmo Intelligence aims to integrate storage and computing technology with large models to enable offline usability and data privacy [4] Future Developments - The release of the M50 chip is viewed as a starting point, with plans for more chips to address computing power, power consumption, and bandwidth issues in edge and endpoint AI computing [5] - Houmo Intelligence has initiated research on next-generation DRAM-PIM technology, which aims to achieve 1TB/s on-chip bandwidth and triple the energy efficiency of current levels [9] Target Markets - The M50 chip is applicable in various fields, including consumer terminals, smart offices, and smart industries, with a focus on offline processing to mitigate data transmission risks [8] - Potential clients include Lenovo's next-generation AI PC, iFlytek's smart voice devices, and China Mobile's new 5G+AI edge computing equipment [8]
斯坦福大模型推理课免费了,谷歌推理团队创始人主讲
量子位· 2025-07-25 07:59
Core Viewpoint - The article discusses the reasoning capabilities of large language models (LLMs) and emphasizes the importance of intermediate reasoning steps in enhancing model confidence and accuracy in problem-solving [5][10][34]. Group 1: Importance of Reasoning in LLMs - Reasoning in LLMs refers to the intermediate thought processes that occur before arriving at a final answer, which can significantly improve the model's ability to solve complex problems [5][11]. - Introducing a chain of thought (CoT) allows LLMs to tackle inherently serial problems without needing to expand the model size, thus bridging the gap between Transformers and Turing machines [12][13]. - The presence of reasoning steps increases the accuracy and reliability of answers, reducing the likelihood of random guessing [14][17]. Group 2: Enhancing Model Confidence - Answers derived from reasoning processes lead to greater confidence in the model's outputs, as they are based on logical deductions rather than mere guesses [19][20]. - Denny Zhou highlights that pre-trained models possess reasoning capabilities even without fine-tuning, although these outputs may not be prioritized in greedy decoding [21][24]. Group 3: Methods to Improve Reasoning - The CoT-decoding method selects reasoning paths from top-k alternatives, enhancing performance on reasoning tasks and approaching the effectiveness of instruction-tuned models [26]. - Supervised fine-tuning (SFT) involves training models on human-written step-by-step problems, but it may lack generalization across new scenarios [27][28]. - Reinforcement learning fine-tuning has emerged as a powerful method for eliciting reasoning, focusing on generating longer responses and improving model performance through iterative training [31]. Group 4: Future Directions - Denny Zhou identifies key areas for future breakthroughs, including addressing tasks with non-unique verifiable answers and developing practical applications beyond benchmark testing [35][40].
AI真的需要「像人类」那样思考吗?AlphaOne揭示属于大模型的「思考之道」
机器之心· 2025-06-23 07:44
Core Viewpoint - The article discusses a new reasoning framework called AlphaOne, which suggests that AI models should adopt a "slow thinking first, fast thinking later" approach during testing, contrasting with the traditional human-like reasoning paradigm [4][5][6]. Group 1: Introduction of AlphaOne - AlphaOne introduces a global reasoning control hyperparameter α that allows models to switch from slow to fast reasoning without additional training, significantly improving reasoning accuracy and efficiency [6][12]. - The framework challenges the assumption that AI must think like humans, proposing a more effective reasoning strategy [6][4]. Group 2: Mechanism of AlphaOne - The core mechanism of AlphaOne involves the introduction of a unified control point called α-moment, which dictates when to transition from slow to fast thinking [16][18]. - Prior to the α-moment, the model uses a probability-driven strategy to guide deep reasoning, while after the α-moment, it switches to a fast thinking mode [20][24]. Group 3: Experimental Results - In experiments across six reasoning tasks, AlphaOne demonstrated superior accuracy compared to existing models, with a notable increase of +6.15% in accuracy for a 1.5 billion parameter model [28][29]. - Despite employing a slow thinking mechanism, AlphaOne reduced the average number of generated tokens by 14%, showcasing its efficiency [30]. Group 4: Scalability and Flexibility - The α-moment allows for scalable adjustments to the thinking phase length, with the ability to increase or decrease the number of slow thinking markers based on the α value [34]. - The framework maintains robust performance across a wide range of α values, indicating its generalizability [34]. Group 5: Future Directions - The article suggests potential future research directions, including the development of more complex slow thinking scheduling strategies and the exploration of cross-modal reasoning applications [46][48].
半壁江山都来了!中国AI算力大会演讲嘉宾全揭晓,同期异构混训、超节点两大研讨会议程公布
傅里叶的猫· 2025-06-17 15:30
Core Viewpoint - The 2025 China AI Computing Power Conference will be held on June 26 in Beijing, focusing on the evolving landscape of AI computing power driven by DeepSeek technology [1][2]. Group 1: Conference Overview - The conference will feature nearly 30 prominent speakers delivering keynotes, reports, and discussions on AI computing power [1]. - It includes a main venue for high-level forums and specialized discussions, as well as closed-door workshops for select attendees [2]. Group 2: Keynote Speakers - Notable speakers include Li Wei from the China Academy of Information and Communications Technology, who will discuss cloud computing standards [4][8]. - Wang Hua, Vice President of Moore Threads, will present on training large models using FP8 precision [12][13]. - Yang Gongyifan, CEO of Zhonghao Xinying, will share insights on high-end chip design and development [14][16]. - Xu Lingjie, CEO of Magik Compute, will address the evolution of compilation technology in AI infrastructure [18][22]. - Chen Xianglin from Qujing Technology will discuss innovations in optimizing large model inference [28][31]. Group 3: Specialized Forums - The conference will host specialized forums on AI inference computing power and smart computing centers, featuring industry leaders discussing cutting-edge technologies [2][4]. - The closed-door workshops will focus on heterogeneous training technologies and supernode technologies, aimed at industry professionals [2][67][71]. Group 4: Ticketing and Participation - The conference offers various ticket types, including free audience tickets and paid VIP tickets, with an application process for attendance [72].
10% KV Cache实现无损数学推理!这个开源方法解决推理大模型「记忆过载」难题
量子位· 2025-06-16 04:49
R-KV团队 投稿 量子位 | 公众号 QbitAI 推理大模型虽好,但一个简单的算数问题能推理整整三页,还都是重复的"废话",找不到重点…… 一种可以把大模型的"碎碎念"转化为可控记忆条目的高效压缩方法,出现了! R-KV开源登场: 显存↓90%、吞吐×6.6、准确率=100% 。 它可以通过实时对token进行排序,兼顾重要性和非冗余性,仅保留信息丰富且多样化的token,从而解决大模型推理时的冗余问题。 让"长时间推理"不再是奢侈品。 项目详情可见文末链接。 R-KV三步走:冗余识别+重要性评估+动态淘汰 链式思考(Chain-of-Thought,CoT)让LLM解题思路清晰可见,却也让推理长度指数级膨胀。 以DeepSeek-R1-Llama-8B为例,一道AIME数学题就能写出 3.2万 个Token:模型权重15.5GB,KV缓存再吃 4.1GB ——显存瞬间见底。 可视化:R-KV vs. SnapKV 现有KV压缩方法(SnapKV、StreamingLLM、H2O等)主要针对 长输入 设计,可一旦模型在输出端开始"碎碎念",相似句子之间互相打高 分注意力,反而让"按注意力删低分"策略失灵: ...
SGLang 推理引擎的技术要点与部署实践|AICon 北京站前瞻
AI前线· 2025-06-13 06:42
Core Insights - SGLang has gained significant traction in the open-source community, achieving nearly 15K stars on GitHub and over 100,000 monthly downloads by June 2025, indicating its popularity and performance [1] - Major industry players such as xAI, Microsoft Azure, NVIDIA, and AMD have adopted SGLang for their production environments, showcasing its reliability and effectiveness [1] - The introduction of a fully open-source large-scale expert parallel deployment solution by SGLang in May 2025 is noted as the only one capable of replicating the performance and cost outlined in the official blog [1] Technical Advantages - The core advantages of SGLang include high-performance implementation and easily modifiable code, which differentiates it from other open-source solutions [3] - Key technologies such as PD separation, speculative decoding, and KV cache offloading have been developed to enhance performance and resource utilization while reducing costs [4][6] Community and Development - The SGLang community plays a crucial role in driving technological evolution and application deployment, with over 100,000 GPU-scale industrial deployment experiences guiding technical advancements [5] - The open-source nature of SGLang encourages widespread participation and contribution, fostering a sense of community and accelerating application implementation [5] Performance Optimization Techniques - PD separation addresses latency fluctuations caused by prefill interruptions during decoding, leading to more stable and uniform decoding delays [6] - Speculative decoding aims to reduce decoding latency by predicting multiple tokens at once, significantly enhancing decoding speed [6] - KV cache offloading allows for the storage of previously computed KV caches in larger storage devices, reducing computation time and response delays in multi-turn dialogues [6] Deployment Challenges - Developers often overlook the importance of tuning numerous configuration parameters, which can significantly impact deployment efficiency despite having substantial computational resources [7] - The complexity of parallel deployment technologies presents compatibility challenges, requiring careful management of resources and load balancing [4][7] Future Directions - The increasing scale of models necessitates the use of more GPUs and efficient parallel strategies for high-performance, low-cost deployments [7] - The upcoming AICon event in Beijing will focus on AI technology advancements and industry applications, providing a platform for further exploration of these topics [8]