大型语言模型

Search documents
美联储:全面召回?大型语言模型的宏观经济知识评价(英文版)
Sou Hu Cai Jing· 2025-07-08 02:02
Core Insights - The report evaluates the performance of large language models (LLMs) in recalling macroeconomic knowledge, particularly focusing on the Claude Sonnet 3.5 model's ability to estimate historical macroeconomic variables and data release dates [1][8][10] - Findings indicate that while LLMs demonstrate impressive recall for certain economic indicators, they also exhibit significant shortcomings, particularly in handling volatile data series and in avoiding look-ahead bias [2][11][18] Group 1: Performance Evaluation - LLMs show strong recall for historical unemployment rates and Consumer Price Index (CPI) values, accurately recalling quarterly values back to World War II [11][44] - However, the model struggles with more volatile data series such as real GDP growth and industrial production growth, often missing high-frequency fluctuations while capturing broader business cycle trends [11][45] - The model's estimates for GDP are found to mix first print values with subsequent revisions, leading to inaccuracies in historical understanding and real-time forecasting simulations [12][14] Group 2: Data Release Dates - LLMs can recall historical data release dates with reasonable accuracy, but they occasionally misestimate these dates by a few days [16] - The accuracy of recalling release dates is sensitive to prompt details, with adjustments to prompts reducing one type of error while increasing another [16] - On average, about 20.2% of days show at least one series with recall issues, indicating limitations in the reliability of LLMs for historical analysis and real-time forecasting [2][16] Group 3: Look-Ahead Bias - Evidence suggests that LLMs may inadvertently incorporate future data values when estimating historical data, even when instructed to ignore future information [15][18] - This look-ahead bias presents challenges for using LLMs in historical analysis and as real-time forecasters, as it reflects a tendency to blend past and future information [18][22] - The report highlights that these errors are reminiscent of human forecasting mistakes, indicating a fundamental challenge in the LLMs' recall capabilities [18][22]
选择合适的大型语言模型:Llama、Mistral 和 DeepSeek
3 6 Ke· 2025-06-30 05:34
Core Insights - Large Language Models (LLMs) have gained popularity and are foundational to AI applications, with a wide range of uses from chatbots to data analysis [1] - The article analyzes and compares three leading open-source LLMs: Llama, Mistral, and DeepSeek, focusing on their performance and technical specifications [1] Group 1: Model Specifications - Each model series offers different parameter sizes (7B, 13B, up to 65-70B), with the number of parameters directly affecting the computational requirements (FLOP) for inference [2] - For instance, Llama and Mistral's 7B models require approximately 14 billion FLOP per token, while the larger Llama-2-70B model requires about 140 billion FLOP per token, making it ten times more computationally intensive [2] - DeepSeek has a 7B version and a larger 67B version, with similar computational requirements to Llama's 70B model [2] Group 2: Hardware Requirements - Smaller models (7B-13B) can run on a single modern GPU, while larger models require multiple GPUs or specialized hardware [3][4] - For example, Mistral 7B requires about 15GB of GPU memory, while Llama-2-13B needs approximately 24GB [3] - The largest models (65B-70B) necessitate 2-4 GPUs or dedicated accelerators due to their high memory requirements [4] Group 3: Memory Requirements - The raw memory required for inference increases with model size, with 7B models occupying around 14-16GB and 13B models around 26-30GB [5] - Fine-tuning requires additional memory for optimizer states and gradients, often needing 2-3 times the memory of the model size [6] - Techniques like LoRA and QLoRA are popular for reducing memory usage during fine-tuning by freezing most weights and training fewer additional parameters [7] Group 4: Performance Trade-offs - In production, there is a trade-off between latency (time taken for a single input to produce a result) and throughput (number of results produced per unit time) [9] - For interactive applications like chatbots, low latency is crucial, while for batch processing tasks, high throughput is prioritized [10][11] - Smaller models (7B, 13B) generally have lower per-token latency compared to larger models (70B), which can only generate a few tokens per second due to higher computational demands [10] Group 5: Production Deployment - All three models are compatible with mainstream open-source tools and have active communities [12][13] - Deployment options include local GPU servers, cloud inference on platforms like AWS, and even running on high-end CPUs for smaller models [14][15] - The models support quantization techniques, allowing for efficient deployment and integration with various service frameworks [16] Group 6: Safety Considerations - Open-source models lack the robust safety features of proprietary models, necessitating the implementation of safety layers for deployment [17] - This may include content filtering systems and rate limiting to prevent misuse [17] - Community efforts are underway to enhance the safety of open models, but they still lag behind proprietary counterparts in this regard [17] Group 7: Benchmark Performance - Despite being smaller, these models perform well on standard benchmarks, with Llama-3-8B achieving around 68.4% on MMLU, 79.6% on GSM8K, and 62.2% on HumanEval [18] - Mistral 7B scores approximately 60.1% on MMLU and 50.0% on GSM8K, while DeepSeek excels with 78.1% on MMLU and 85.5% on GSM8K [18][19][20] - The performance of these models indicates significant advancements in model design and training techniques, allowing them to compete with larger models [22][25]
俄罗斯联邦储蓄银行第一副首席执行官:俄罗斯联邦储蓄银行计划在不久的将来推出具有推理能力的大型语言模型。
news flash· 2025-06-18 08:06
Core Viewpoint - The Deputy CEO of Sberbank announced plans to launch a large language model with reasoning capabilities in the near future [1] Company Summary - Sberbank is focusing on the development of advanced AI technologies, specifically a large language model that can perform reasoning tasks [1]
AI成为数学家得力助手还要多久
Ke Ji Ri Bao· 2025-06-17 01:18
Core Viewpoint - The article discusses the current state and future potential of AI in assisting mathematical research, highlighting both advancements and limitations in AI's capabilities to solve complex mathematical problems. Group 1: AI Advancements in Mathematics - The U.S. Defense Advanced Research Projects Agency (DARPA) launched the "Exponential Mathematics" program to develop AI systems that can significantly enhance mathematical research efficiency [1] - New generation large language models (LLMs) like OpenAI's o3 and Anthropic's Claude 4 Thinking have shown improvements, performing at levels close to excellent high school students in competitions [2] - Google's AlphaProof system combines LLMs with chess AI, achieving results comparable to silver medalists in the International Mathematical Olympiad [2] - The AlphaEvolve model from Google has found solutions to long-standing mathematical and computational problems that outperform existing human methods [2] Group 2: Limitations of AI in Mathematics - Despite impressive performances, experts believe that current AI models lack the capability to assist in genuine mathematical research, as competition problems are more like intellectual games with certain patterns [2] - A test by Epoch AI revealed that LLMs struggled with high-difficulty problems designed to avoid previously seen training data, indicating significant limitations in their problem-solving abilities [3] - AI faces challenges with "super long reasoning chains," where complex problems may require millions of steps to solve, making it difficult for AI to find the correct solutions [5] Group 3: Innovative Approaches and Future Directions - Researchers are developing methods to package multiple steps into "super steps" to tackle complex problems, which has led to breakthroughs in classic unsolved problems [5][6] - The exploration of new mathematical ideas is crucial, and AI tools like AlphaEvolve can generate and refine solutions, allowing for human intervention to provide inspiration [7] - AI is seen as a potential tool for discovering new mathematical objects, but it currently lacks true creativity, with significant innovations still attributed to human mathematicians [8]
每日机构分析:6月13日
Xin Hua Cai Jing· 2025-06-13 08:29
Group 1 - HSBC's foreign exchange strategy head indicates that geopolitical risks are putting pressure on the British pound, which is seen as a risk-sensitive currency, dropping to around 1.3530 against the US dollar [1] - Danske Bank analysts report that the recent 30-year US Treasury auction showed strong demand, alleviating concerns about long-term US Treasury demand and pushing yields below the critical 5% level [1] - The Swedish Nordea Bank anticipates that the Swedish central bank will lower interest rates in June, reflecting expectations among fixed-income investors [2] Group 2 - Analysts from Mizuho Securities highlight that the current geopolitical tensions have not been fully reflected in market volatility, with risks of full-scale conflict increasing [2] - HSBC Global Research predicts that the Philippine central bank will lower its policy rate to 5.25%, differing from previous expectations of maintaining rates, due to low inflation and slow economic growth [2] - Economists from Wilmington Trust suggest that long-term impacts of US tariffs are more likely to lead to economic weakness rather than inflation, with consumers beginning to cut back on non-essential spending [2] Group 3 - RSM's chief economist notes that rising prices in the US appliance market reflect cost increases from previous import tariffs, emphasizing the importance of consumer behavior in determining inflation persistence [3] - Goldman Sachs analysts report that the US data center securitization market has surged from $5 billion to $30 billion, driven by increased capital expenditure in cloud computing and policy support [3] - The data center market is expected to peak in occupancy rates by mid-2026, with growth primarily fueled by large investments in facilities equipped with thousands of GPUs for large language models [3]
中科院团队自研大模型,自动设计超强芯片
半导体行业观察· 2025-06-12 00:42
Core Viewpoint - The article discusses the development of QiMeng, an innovative system for fully automated hardware and software design of processor chips, addressing the challenges faced in traditional design paradigms due to advancements in information technology and the limitations of existing methods [1][5][18]. Group 1: Challenges in Processor Chip Design - Traditional design paradigms face three fundamental limitations: constraints of manufacturing technology, limited design resources, and the increasing diversity of ecosystems [4][5]. - The physical limits of semiconductor manufacturing processes, particularly below 7nm, pose significant challenges, necessitating innovative design methods [4][5]. - The traditional design process is labor-intensive and requires extensive expertise, leading to prolonged development cycles and high costs [5][6]. Group 2: Automation in Processor Chip Design - Automated processor chip design aims to streamline the entire design and verification process, leveraging artificial intelligence to enhance performance while reducing manual intervention [5][6]. - The latest breakthroughs in large language models (LLMs) and multi-agent systems create new opportunities for fully automated processor chip design [6][18]. - QiMeng is structured in three layers: a large processor chip model (LPCM) at the base, hardware and software design agents in the middle, and various design applications at the top [10][18]. Group 3: QiMeng System Components - LPCM is designed to address key challenges such as knowledge representation gaps, data scarcity, correctness guarantees, and enormous solution spaces [10][25]. - The hardware design agent employs a dual feedback mechanism to achieve end-to-end automation from functional specifications to physical layout [11][43]. - The software design agent focuses on adapting and optimizing foundational software to meet the diverse needs of modern applications [47][49]. Group 4: Future Directions - Future research will focus on integrating all components of QiMeng and executing iterative design processes to enhance its capabilities [2][22]. - The development roadmap includes a three-phase approach: top-down application implementation, bottom-up agent reconstruction, and iterative cycles combining both methods [20][21][22]. - Current work has successfully achieved significant milestones in automated front-end design and software optimization, laying a solid foundation for the complete realization of QiMeng [22][54].
中科院团队自研大模型,自动设计超强芯片
半导体行业观察· 2025-06-12 00:41
Core Viewpoint - The article discusses the development of QiMeng, an innovative system for fully automated hardware and software design of processor chips, addressing the challenges faced in traditional design paradigms due to advancements in information technology and the limitations of existing methods [1][5][18]. Group 1: Challenges in Processor Chip Design - Traditional design paradigms face three fundamental limitations: constraints of manufacturing technology, limited design resources, and the increasing diversity of ecosystems [4][5]. - The physical limits of semiconductor manufacturing processes, particularly below 7nm, pose significant challenges, necessitating innovative design methods [4][5]. - The traditional design process is labor-intensive and requires extensive expertise, leading to prolonged development cycles and high costs [5][6]. Group 2: Automation in Processor Chip Design - Automated processor chip design aims to streamline the entire design and verification process, leveraging artificial intelligence to surpass manual design capabilities [5][6]. - Automation can significantly reduce human intervention, enhance design efficiency, shorten development cycles, and lower costs while allowing for rapid customization of chip architectures [5][6]. - The latest breakthroughs in large language models (LLMs) and multi-agent systems create new opportunities for fully automated processor chip design [6][18]. Group 3: QiMeng System Overview - QiMeng consists of three layers: a Large Processor Chip Model (LPCM) at the bottom, hardware and software design agents in the middle, and various application programs at the top [9][10]. - LPCM is designed to address key challenges in processor chip design, including knowledge representation gaps, data scarcity, correctness guarantees, and enormous solution spaces [10][25]. - The system aims to integrate all components and execute iterative design processes to establish a complete QiMeng system [2][12]. Group 4: LPCM Innovations - LPCM employs a multi-modal architecture to understand and represent the inherent graph data in processor chip design, addressing the knowledge representation gap [10][26]. - A cross-stage collaborative design database is essential for training LPCM, enabling the generation of large-scale, cross-stage aligned processor chip design data [28][29]. - LPCM's feedback-driven reasoning mechanism incorporates both functionality correctness feedback and performance feedback to ensure high-quality design outputs [32][34]. Group 5: Hardware and Software Design Agents - The hardware design agent utilizes a dual feedback mechanism to achieve end-to-end automated design from functional specifications to physical layouts [11][45]. - The software design agent focuses on automating the adaptation and optimization of foundational software, addressing the challenges posed by diverse instruction set architectures [50][51]. - Both agents are designed to work collaboratively, enhancing the overall efficiency and effectiveness of the automated design process [40][48]. Group 6: Future Directions - Future research will focus on integrating all components of QiMeng and establishing a self-evolving framework that enhances its capabilities for automated processor chip design [12][22]. - The development roadmap includes transitioning from top-down to bottom-up approaches, ultimately creating a system that can adapt to increasingly complex design scenarios [21][22].
世界顶尖数学家在测试中震惊地发现,人工智能模型已经接近数学天才了
3 6 Ke· 2025-06-08 23:49
Core Insights - The AI reasoning model, o4-mini, has demonstrated capabilities close to that of a mathematical genius, impressing researchers at a secret math conference in Berkeley, California [1][5][7] - o4-mini, developed by OpenAI, is a lightweight and flexible large language model (LLM) that has undergone specialized training, allowing it to tackle complex mathematical problems more effectively than traditional LLMs [1][2] - The ongoing FrontierMath project aims to evaluate o4-mini's performance on a range of mathematical problems, with initial results showing it can solve approximately 20% of undergraduate to research-level challenges [3][4] Group 1 - A secret math conference gathered 30 renowned mathematicians to test the capabilities of the o4-mini AI model, which was able to solve some of the world's most challenging problems [1] - The o4-mini model was trained on specialized datasets and received reinforcement learning from humans, enhancing its ability to reason through complex mathematical issues [1][2] - The project FrontierMath, initiated by Epoch AI, will assess o4-mini's performance on new mathematical problems, with a focus on various difficulty levels [3][4] Group 2 - During the conference, mathematicians were surprised by o4-mini's ability to solve a problem considered an open question in number theory, showcasing its advanced reasoning skills [5][6] - The AI's speed in solving problems significantly outpaces that of human experts, completing tasks in minutes that would take professionals weeks or months [6] - Concerns were raised about the potential over-reliance on AI results, as o4-mini's confident assertions could lead to misplaced trust in its conclusions [6][7] Group 3 - The discussions at the conference included the future role of mathematicians in light of AI advancements, suggesting a shift towards collaboration with AI to explore new mathematical truths [6][7] - Ken Ono expressed that the performance of large language models like o4-mini has surpassed that of many top graduate students, indicating a significant leap in AI capabilities [7]
英伟达,遥遥领先
半导体芯闻· 2025-06-05 10:04
Core Insights - The latest MLPerf benchmark results indicate that Nvidia's GPUs continue to dominate the market, particularly in the pre-training of the Llama 3.1 403B large language model, despite AMD's recent advancements [1][2][3] - AMD's Instinct MI325X GPU has shown performance comparable to Nvidia's H200 in popular LLM fine-tuning benchmarks, marking a significant improvement over its predecessor [3][6] - The MLPerf competition includes six benchmarks targeting various machine learning tasks, emphasizing the industry's trend towards larger models and more resource-intensive pre-training processes [1][2] Benchmark Performance - The pre-training task is the most resource-intensive, with the latest iteration using Meta's Llama 3.1 403B, which is over twice the size of GPT-3 and utilizes a four times larger context window [2] - Nvidia's Blackwell GPU achieved the fastest training times across all six benchmarks, with the first large-scale deployment expected to enhance performance further [2][3] - In the LLM fine-tuning benchmark, Nvidia submitted a system with 512 B200 processors, highlighting the importance of efficient GPU interconnectivity for scaling performance [6][9] GPU Utilization and Efficiency - The latest submissions for the pre-training benchmark utilized between 512 and 8,192 GPUs, with performance scaling approaching linearity, achieving 90% of ideal performance [9] - Despite the increased requirements for pre-training benchmarks, the maximum GPU submissions have decreased from over 10,000 in previous rounds, attributed to improvements in GPU technology and interconnect efficiency [12] - Companies are exploring integration of multiple AI accelerators on a single large wafer to minimize network-related losses, as demonstrated by Cerebras [12] Power Consumption - MLPerf also includes power consumption tests, with Lenovo being the only company to submit results this round, indicating a need for more submissions in future tests [13] - The power consumption for fine-tuning LLMs on two Blackwell GPUs was measured at 6.11 gigajoules, equivalent to the energy required for heating a small house in winter [13]
刚刚,新一届ACM博士论文奖正式公布
机器之心· 2025-06-05 07:14
机器之心报道 编辑:张倩、+0 近日,新一届 ACM 博士论文奖正式公布。 该奖项每年颁发给计算机科学与工程领域最佳博士论文的作者。今年颁发的是 2024 年的奖项,包括一个博士论文奖和两个博士论文奖荣誉提名。 获得博士论文奖的论文非常有现实意义,它研究的是:现在心理健康问题越来越多,但专业心理医生不够用,怎么办? 我们知道,在 DeepSeek 等 AI 模型火起来之后,很多人都把 AI 当成了心理医生。但很多时候,AI 并不能像真正的心理治疗师一样提供专业指导。或许,「人机 协作」是条更现实的折中路线。 在论文中,获奖作者 Ashish Sharma 探索了多种方法来实现更好的人机协作。他的方法类似于: 他最近开发的 AI 辅助心理健康工具已被公开发布,并有超过 16 万用户使用,其中大多数是低收入人群。使用这些工具的人群中,超过 50% 的家庭年收入低于 4 万美元。 除了这篇论文,还有两篇论文获得了博士论文奖荣誉提名,其中一篇研究的问题是「利用伪随机分布揭示低复杂度计算模型的固有计算局限性」;另一篇则专注 于「大型语言模型如何利用它们在训练时学习到的海量文本数据」。 随着全球心理健康问题激增,医疗保健 ...