Workflow
Transformer模型
icon
Search documents
GPT在模仿人类?Nature发现:大脑才是最早的Transformer
3 6 Ke· 2025-12-11 10:48
【导读】我们以为语言是语法、规则、结构。但最新的Nature研究却撕开了这层幻觉。GPT的层级结构与竟与人大脑里的「时间印记」一模一样。当浅 层、中层、深层在脑中依次点亮,我们第一次看见:理解语言,也许从来不是解析,而是预测。 我们一直深信,人类的大脑,是靠着一套严谨的规则、语法和结构分析来理解语言的,复杂且独一无二。 这是一个持续了数十年的「共识」。 可最近Nature Communications上发表的一项颠覆性研究,把这个古老的信仰掀了个底朝天。 论文地址:https://www.nature.com/articles/s41467-025-65518-0 研究者们让受试者听30分钟的故事,同时用毫秒级的脑电技术,精准捕捉大脑对每个词语的反应。 接着,他们将同样的故事文本输入给大语言模型,比如GPT-2和Llama-2,提取模型每一层对文本的内部理解表示 令人震惊的实验结果出现了: GPT那看似冰冷的层级结构,竟然在人类的大脑里,找到了完美的时间对应关系。 过去,我们总以为是GPT在模仿人类。但这个实验却给出了一个石破天惊的暗示: 或许,我们的大脑,天然就长成了「GPT」的样子。 GPT的结构,能在大 ...
NeurIPS 2025 | DePass:通过单次前向传播分解实现统一的特征归因
机器之心· 2025-12-01 04:08
共同一作:洪翔宇,清华大学电子系大四本科生,曾获清华大学蒋南翔奖学金等,曾在NeurIPS,EMNLP,NAACL等顶级会议上发表论文。姜澈,清华大 学电子系博士三年级在读,主要研究方向为LLM Interpretebility,LLM Agent,曾在NeurIPS,ICML,EMNLP,NAACL等顶级会议上发表论文。 随着大型语言模型在各类任务中展现出卓越的生成与推理能力,如何将模型输出精确地追溯到其内部计算过程,已成为 AI 可解释性研究的重要方向。然 而,现有方法往往计算代价高昂、难以揭示中间层的信息流动;同时,不同层面的归因(如 token、模型组件或表示子空间)通常依赖各自独立的特定方 法,缺乏统一且高效的分析框架。 针对这一问题,来自清华、上海 AI Lab 的研究团队提出了全新的统一特征归因框架——DePass(Decomposed Forward Pass)。 该方法通过将前向传播中的每个隐藏状态分解为多个可加子状态,并在固定注意力权重与 MLP 激活的情况下对其逐层传播,实现了对 Transformer 内部 信息流的无损分解与精确归因。借助 DePass,研究者能够在输入 token、 ...
为Transformer注入长期记忆:Memo框架通过“学会做摘要”解决具身智能核心挑战
机器人大讲堂· 2025-10-29 10:03
Core Insights - The article discusses the limitations of Transformer models in handling long-term memory tasks and introduces Memo, a new architecture designed to enhance memory efficiency in long-sequence reinforcement learning tasks [1][3][18] Group 1: Memo Framework - Memo mimics human note-taking behavior by allowing the model to autonomously generate and store summaries of past experiences, enabling efficient retrieval of long-term memory with minimal memory overhead [3][5] - The framework processes long input sequences in segments and generates a fixed number of optimized summary tokens at the end of each segment [4][5] Group 2: Technical Implementation - Memo employs a special attention masking mechanism to ensure the model accesses past information only through summary tokens, creating a conscious information bottleneck [6] - It utilizes flexible positional encoding to help the model understand the temporal position of observations and summaries, which is crucial for causal relationships [6] - The introduction of segment length randomization during training enhances the model's adaptability to varying task rhythms [6] Group 3: Experimental Validation - Memo was tested in two embodied intelligence scenarios: the ExtObjNav task and the Dark-Key-To-Door task, comparing its performance against baseline models like Full Context Transformer (FCT) and Recurrent Memory Transformer (RMT) [7][11] - In the ExtObjNav task, Memo demonstrated superior performance, reducing the number of context tokens used by 8 times while maintaining strong reasoning capabilities beyond the training sequence length [9] - In the Dark-Key-To-Door task, Memo consistently remembered the locations of the key and door, while FCT showed significant performance decline after a certain number of steps, highlighting the challenges faced by full-context models [11] Group 4: Key Findings from Ablation Studies - The cumulative memory mechanism of Memo outperforms fixed memory models, akin to human wisdom accumulation rather than relying solely on recent experiences [14] - Long-range gradient propagation is essential for effective memory utilization, as limiting gradients to short-term memory significantly degrades performance [17] - An optimal summary length of 32 tokens strikes a balance between information compression and retention, as excessive summary tokens can introduce redundancy and noise [17] Group 5: Conclusion and Future Directions - Memo represents a significant advancement towards more efficient and intelligent long-term reasoning in AI, allowing models to autonomously manage their attention and memory [18] - The memory mechanism has broad applications, including autonomous navigation robots and personalized systems that understand long-term user preferences [18] - Future research will focus on enhancing the adaptability and interpretability of memory mechanisms, as well as balancing memory stability and flexibility [18]
Nature子刊:上海科学智能研究院漆远/曹风雷/徐丽成团队开发新型AI模型,用于化学反应性能预测和合成规划
生物世界· 2025-08-24 08:30
Core Viewpoint - Artificial Intelligence (AI) has significantly transformed the field of precise organic synthesis, showcasing immense potential in predicting reaction performance and synthesis planning through data-driven methods, including machine learning and deep learning [2][3]. Group 1: Research Overview - A recent study published in Nature Machine Intelligence introduces a unified pre-trained deep learning framework called RXNGraphormer, which integrates Graph Neural Networks (GNN) and Transformer models to address the methodological discrepancies between reaction performance prediction and synthesis planning [3][5]. - The RXNGraphormer framework is designed to collaboratively handle both reaction performance prediction and synthesis planning tasks through a unified pre-training approach [5][7]. Group 2: Performance and Training - The RXNGraphormer model was trained on 13 million chemical reactions and achieved state-of-the-art (SOTA) performance across eight benchmark datasets in reaction activity/selectivity prediction and forward/reverse synthesis planning, as well as on three external real-world datasets [5][7]. - Notably, the chemical feature embeddings generated by the model can autonomously cluster by reaction type in an unsupervised manner [5].
中金:一种结合自注意力机制的GRU模型
中金点睛· 2025-07-14 23:39
Core Viewpoint - The article discusses the evolution and optimization of time series models, particularly focusing on GRU and Transformer architectures, and introduces a new model called AttentionGRU(Res) that combines the strengths of both [1][6][49]. Group 1: Time Series Models Overview - Time series models, such as LSTM, GRU, and Transformer, are designed for analyzing and predicting sequential data, effectively addressing long-term dependencies through specialized gating mechanisms [1][8]. - GRU, as an optimized variant, enhances computational efficiency while maintaining long-term memory capabilities, making it suitable for real-time prediction scenarios [2][4]. - The Transformer model revolutionizes sequence modeling through self-attention mechanisms and position encoding, demonstrating significant advantages in analyzing multi-dimensional time series data [2][4]. Group 2: Performance Comparison of Factors - A systematic test of 159 cross-sectional factors and 158 time series factors revealed that while cross-sectional factors generally outperform time series factors, the latter showed better out-of-sample performance when used in RNN, LSTM, and GRU models [4][21]. - The average ICIR (Information Coefficient Information Ratio) for time series factors was found to be higher than that of cross-sectional factors, indicating better predictive performance despite a more dispersed distribution [4][20]. - In terms of returns, cross-sectional factors yielded a long-short excess return of 11%, compared to only 1% for time series factors, highlighting the differences in performance metrics [4][20]. Group 3: Model Optimization Strategies - The article explores various optimization strategies for time series models, including adjustments to the propagation direction of time series, optimization of gating structures, and overall structural combinations [5][27]. - Testing of BiGRU and GLU models showed limited improvement over the standard GRU model, while the Transformer model exhibited significant in-sample performance but suffered from overfitting in out-of-sample tests [5][28]. - The proposed AttentionGRU(Res) model combines a simplified self-attention mechanism with GRU, achieving a balance between performance and stability, resulting in an annualized excess return of over 30% in the full market [6][40][41]. Group 4: AttentionGRU(Res) Model Performance - The AttentionGRU(Res) model demonstrated strong performance, achieving a near 12.6% annualized excess return over the past five years in rolling samples, indicating its robustness in various market conditions [6][49]. - The model's generalization ability was validated within the CSI 1000 stock range, yielding an annualized excess return of 10.8%, outperforming traditional GRU and Transformer structures [6][46][49]. - The integration of residual connections and simplified self-attention structures in the AttentionGRU(Res) model significantly improved training stability and predictive performance [35][40].
一种新型的超大规模光电混合存算方案
半导体行业观察· 2025-06-29 01:51
Core Viewpoint - The article discusses the development of a novel 2T1M optoelectronic hybrid computing architecture that addresses the IR drop issue in traditional CIM architectures, enabling larger array sizes and improved performance for deep learning applications, particularly in large-scale Transformer models [1][2][9]. Group 1: Architecture Design and Working Principle - The 2T1M architecture integrates electronic and photonic technologies to mitigate IR drop issues, utilizing a combination of two transistors and a modulator in each storage unit [2]. - The architecture employs FeFETs for multiplication operations, which exhibit low static power consumption and excellent linear characteristics in the subthreshold region [2]. - FeFETs demonstrate a sub-pA cutoff current and are expected to maintain performance over 10 years with over 10 million cycles [2]. Group 2: Optoelectronic Conversion and Lossless Summation - The architecture utilizes lithium niobate (LN) modulators for converting electrical signals to optical signals, leveraging the Pockels effect to achieve phase shifts in light signals [4][6]. - The integration of multiple 2T1M units in a Mach-Zehnder interferometer allows for effective accumulation of phase shifts, enabling lossless summation of vector-matrix multiplication results [4][6]. Group 3: Transformer Application - Experimental results indicate that the 2T1M architecture achieves a 93.3% inference accuracy when running the ALBERT model, significantly outperforming traditional CIM architectures, which only achieve 48.3% accuracy under the same conditions [9]. - The 2T1M architecture supports an array size of up to 3750kb, which is over 150 times larger than traditional CIM architectures limited to 256kb due to IR drop constraints [9]. - The architecture's power efficiency is reported to be 164 TOPS/W, representing a 37-fold improvement over state-of-the-art traditional CIM architectures, which is crucial for enhancing energy efficiency in edge computing and data centers [9].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].
哈佛新论文揭示 Transformer 模型与人脑“同步纠结”全过程,AI也会犹豫、反悔?
3 6 Ke· 2025-05-12 00:22
Core Insights - A recent study from researchers at Harvard University, Brown University, and the University of Tübingen explores the similarities between the processing dynamics of Transformer models and real-time human cognition [1][2]. Group 1: Research Objectives - The study aims to investigate the internal processing of AI models, particularly how it compares to human cognitive processes, rather than just focusing on the final outputs [2][4]. - The authors propose to analyze the processing dynamics at each layer of the Transformer model to see if they align with the real-time information processing in the human brain [4][24]. Group 2: Methodology and Findings - The researchers recorded the outputs and changes at each layer of the Transformer model, introducing a series of "processing load" metrics to understand how confidence in answers evolves through the layers [7][24]. - The study found that both AI and humans exhibit similar patterns of hesitation and correction when faced with challenging questions, indicating a shared cognitive process [11][18][23]. Group 3: Specific Examples - In the "capital killer question," both AI and humans initially leaned towards incorrect answers (e.g., Chicago) before correcting themselves to the right answer (Springfield) [13][15]. - In animal classification tasks, both AI and humans showed a tendency to initially misclassify (e.g., thinking whales are fish) before arriving at the correct classification [18][19]. - The study also highlighted that in logical reasoning tasks, both AI and humans can be misled by common misconceptions, leading to similar paths of confusion before reaching the correct conclusion [21][24]. Group 4: Implications - The findings suggest that understanding the internal "thought processes" of AI could provide insights into human cognitive challenges and potentially guide experimental design in cognitive science [24].
为什么现在做AI的产品经理,都是在亏钱?
3 6 Ke· 2025-05-06 01:50
Core Insights - The current landscape of AI product management is characterized by a focus on iterative improvements rather than creating products from scratch, leading to instability and financial losses for AI product managers [1][21] - The transformer model, while popular, is not necessarily the best architecture for AI applications, as it struggles with issues like hallucination and high training costs [2][5] - The emergence of alternative models, such as diffusion models and the yan model, indicates a shift in the AI landscape, with potential implications for product design and functionality [3][5] Group 1: AI Product Management Challenges - AI product managers are primarily engaged in API integration rather than developing proprietary models, limiting their ability to innovate and compete [6][8] - The high costs associated with AI model fine-tuning and infrastructure, including server costs and operational expenses, create significant barriers to profitability [9][10] - The user acquisition process for AI products still relies on traditional internet marketing strategies, which may not be sufficient to differentiate AI offerings in a crowded market [10][12] Group 2: User Perception and Market Dynamics - The transition of AI from a novelty to a necessity has not yet been fully realized, as the productivity gains from AI tools remain unclear [15][20] - Despite the potential of AI to assist in various tasks, the need for human oversight and correction limits the efficiency gains that users experience [17][21] - The willingness of users to pay for AI services is low, as many seek free alternatives or are hesitant to invest in AI tools that do not demonstrate clear value [21][22]
深度|英伟达黄仁勋:GPU是一台时光机,让人们看到未来;下一个十年AI将在某些领域超越人类的同时赋能人类
Z Potentials· 2025-03-01 03:53
Core Insights - NVIDIA has rapidly evolved into one of the world's most valuable companies due to its pioneering role in transforming computing through innovative chip and software designs, particularly in the AI era [2][3]. Group 1: Historical Context - The inception of NVIDIA was driven by the observation that a small portion of code in software could handle the majority of processing through parallel execution, leading to the development of the first modern GPU [3][4]. - The choice to focus on video games was strategic, as the gaming market was identified as a potential driver for technological advancements and a significant entertainment market [5][6]. Group 2: Technological Innovations - The introduction of CUDA allowed programmers to utilize familiar programming languages to harness GPU power, significantly broadening the accessibility of parallel processing capabilities [7][9]. - The success of AlexNet in 2012 marked a pivotal moment in AI, demonstrating the potential of GPUs in training deep learning models, which initiated a profound transformation in the AI landscape [11][12]. Group 3: Current Developments - Major breakthroughs in computer vision, speech recognition, and language understanding have been achieved in recent years, showcasing the rapid advancements in AI capabilities [14][15]. - NVIDIA is focusing on the application of AI in various fields, including digital biology, climate science, and robotics, indicating a shift towards practical applications of AI technology [21][38]. Group 4: Future Vision - The future of automation is anticipated to encompass all moving entities, with robots and autonomous systems becoming commonplace in daily life [26][27]. - NVIDIA's ongoing projects, such as Omniverse and Cosmos, aim to create advanced generative systems that will significantly impact robotics and physical systems [37][38]. Group 5: Energy Efficiency and Limitations - The company emphasizes the importance of energy efficiency in computing, having achieved a remarkable 10,000-fold increase in energy efficiency for AI computations since 2016 [32][33]. - Current physical limitations in computing are acknowledged, with a focus on improving energy efficiency to enhance computational capabilities [31][32].