Transformer模型

Search documents
Nature子刊:上海科学智能研究院漆远/曹风雷/徐丽成团队开发新型AI模型,用于化学反应性能预测和合成规划
生物世界· 2025-08-24 08:30
撰文丨王聪 编辑丨王多鱼 排版丨水成文 人工智能 (AI) 已经改变了精准有机合成领域。包括机器学习和深度学习在内的数据驱动方法在预测反应性能和合成规划方面展现出了巨大潜力。 然而, 数值回归驱动的反应性能预测与序列生成驱动的合成规划之间存在内在方法论分歧,这对构建 统一的深度学习架构 构成重大挑战。 上海科学智能研究院 / 复旦大学人工智能创新与产业研究院 漆远 教授、 上海科学智能研究院 曹风雷 研究员 、 徐丽成 研究员等在 Nature 子刊 Nature Machine Intelligence 上发表了题为: A unified pre-trained deep learning framework for cross-task reaction performance prediction and synthesis planning 的研究论文。 该研究听过整合 图神经网络 (GNN) 与 Transformer 模型 ,开发了一个用于 跨任务的反应性能预测和合成规划的统一预训练深度学习框架—— RXNGraphormer ,为化学反应预测与合成设计提供了一个通用工具。 通过将面向分子内模式识 ...
中金:一种结合自注意力机制的GRU模型
中金点睛· 2025-07-14 23:39
Core Viewpoint - The article discusses the evolution and optimization of time series models, particularly focusing on GRU and Transformer architectures, and introduces a new model called AttentionGRU(Res) that combines the strengths of both [1][6][49]. Group 1: Time Series Models Overview - Time series models, such as LSTM, GRU, and Transformer, are designed for analyzing and predicting sequential data, effectively addressing long-term dependencies through specialized gating mechanisms [1][8]. - GRU, as an optimized variant, enhances computational efficiency while maintaining long-term memory capabilities, making it suitable for real-time prediction scenarios [2][4]. - The Transformer model revolutionizes sequence modeling through self-attention mechanisms and position encoding, demonstrating significant advantages in analyzing multi-dimensional time series data [2][4]. Group 2: Performance Comparison of Factors - A systematic test of 159 cross-sectional factors and 158 time series factors revealed that while cross-sectional factors generally outperform time series factors, the latter showed better out-of-sample performance when used in RNN, LSTM, and GRU models [4][21]. - The average ICIR (Information Coefficient Information Ratio) for time series factors was found to be higher than that of cross-sectional factors, indicating better predictive performance despite a more dispersed distribution [4][20]. - In terms of returns, cross-sectional factors yielded a long-short excess return of 11%, compared to only 1% for time series factors, highlighting the differences in performance metrics [4][20]. Group 3: Model Optimization Strategies - The article explores various optimization strategies for time series models, including adjustments to the propagation direction of time series, optimization of gating structures, and overall structural combinations [5][27]. - Testing of BiGRU and GLU models showed limited improvement over the standard GRU model, while the Transformer model exhibited significant in-sample performance but suffered from overfitting in out-of-sample tests [5][28]. - The proposed AttentionGRU(Res) model combines a simplified self-attention mechanism with GRU, achieving a balance between performance and stability, resulting in an annualized excess return of over 30% in the full market [6][40][41]. Group 4: AttentionGRU(Res) Model Performance - The AttentionGRU(Res) model demonstrated strong performance, achieving a near 12.6% annualized excess return over the past five years in rolling samples, indicating its robustness in various market conditions [6][49]. - The model's generalization ability was validated within the CSI 1000 stock range, yielding an annualized excess return of 10.8%, outperforming traditional GRU and Transformer structures [6][46][49]. - The integration of residual connections and simplified self-attention structures in the AttentionGRU(Res) model significantly improved training stability and predictive performance [35][40].
一种新型的超大规模光电混合存算方案
半导体行业观察· 2025-06-29 01:51
Core Viewpoint - The article discusses the development of a novel 2T1M optoelectronic hybrid computing architecture that addresses the IR drop issue in traditional CIM architectures, enabling larger array sizes and improved performance for deep learning applications, particularly in large-scale Transformer models [1][2][9]. Group 1: Architecture Design and Working Principle - The 2T1M architecture integrates electronic and photonic technologies to mitigate IR drop issues, utilizing a combination of two transistors and a modulator in each storage unit [2]. - The architecture employs FeFETs for multiplication operations, which exhibit low static power consumption and excellent linear characteristics in the subthreshold region [2]. - FeFETs demonstrate a sub-pA cutoff current and are expected to maintain performance over 10 years with over 10 million cycles [2]. Group 2: Optoelectronic Conversion and Lossless Summation - The architecture utilizes lithium niobate (LN) modulators for converting electrical signals to optical signals, leveraging the Pockels effect to achieve phase shifts in light signals [4][6]. - The integration of multiple 2T1M units in a Mach-Zehnder interferometer allows for effective accumulation of phase shifts, enabling lossless summation of vector-matrix multiplication results [4][6]. Group 3: Transformer Application - Experimental results indicate that the 2T1M architecture achieves a 93.3% inference accuracy when running the ALBERT model, significantly outperforming traditional CIM architectures, which only achieve 48.3% accuracy under the same conditions [9]. - The 2T1M architecture supports an array size of up to 3750kb, which is over 150 times larger than traditional CIM architectures limited to 256kb due to IR drop constraints [9]. - The architecture's power efficiency is reported to be 164 TOPS/W, representing a 37-fold improvement over state-of-the-art traditional CIM architectures, which is crucial for enhancing energy efficiency in edge computing and data centers [9].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].
哈佛新论文揭示 Transformer 模型与人脑“同步纠结”全过程,AI也会犹豫、反悔?
3 6 Ke· 2025-05-12 00:22
Core Insights - A recent study from researchers at Harvard University, Brown University, and the University of Tübingen explores the similarities between the processing dynamics of Transformer models and real-time human cognition [1][2]. Group 1: Research Objectives - The study aims to investigate the internal processing of AI models, particularly how it compares to human cognitive processes, rather than just focusing on the final outputs [2][4]. - The authors propose to analyze the processing dynamics at each layer of the Transformer model to see if they align with the real-time information processing in the human brain [4][24]. Group 2: Methodology and Findings - The researchers recorded the outputs and changes at each layer of the Transformer model, introducing a series of "processing load" metrics to understand how confidence in answers evolves through the layers [7][24]. - The study found that both AI and humans exhibit similar patterns of hesitation and correction when faced with challenging questions, indicating a shared cognitive process [11][18][23]. Group 3: Specific Examples - In the "capital killer question," both AI and humans initially leaned towards incorrect answers (e.g., Chicago) before correcting themselves to the right answer (Springfield) [13][15]. - In animal classification tasks, both AI and humans showed a tendency to initially misclassify (e.g., thinking whales are fish) before arriving at the correct classification [18][19]. - The study also highlighted that in logical reasoning tasks, both AI and humans can be misled by common misconceptions, leading to similar paths of confusion before reaching the correct conclusion [21][24]. Group 4: Implications - The findings suggest that understanding the internal "thought processes" of AI could provide insights into human cognitive challenges and potentially guide experimental design in cognitive science [24].
深度|英伟达黄仁勋:GPU是一台时光机,让人们看到未来;下一个十年AI将在某些领域超越人类的同时赋能人类
Z Potentials· 2025-03-01 03:53
Core Insights - NVIDIA has rapidly evolved into one of the world's most valuable companies due to its pioneering role in transforming computing through innovative chip and software designs, particularly in the AI era [2][3]. Group 1: Historical Context - The inception of NVIDIA was driven by the observation that a small portion of code in software could handle the majority of processing through parallel execution, leading to the development of the first modern GPU [3][4]. - The choice to focus on video games was strategic, as the gaming market was identified as a potential driver for technological advancements and a significant entertainment market [5][6]. Group 2: Technological Innovations - The introduction of CUDA allowed programmers to utilize familiar programming languages to harness GPU power, significantly broadening the accessibility of parallel processing capabilities [7][9]. - The success of AlexNet in 2012 marked a pivotal moment in AI, demonstrating the potential of GPUs in training deep learning models, which initiated a profound transformation in the AI landscape [11][12]. Group 3: Current Developments - Major breakthroughs in computer vision, speech recognition, and language understanding have been achieved in recent years, showcasing the rapid advancements in AI capabilities [14][15]. - NVIDIA is focusing on the application of AI in various fields, including digital biology, climate science, and robotics, indicating a shift towards practical applications of AI technology [21][38]. Group 4: Future Vision - The future of automation is anticipated to encompass all moving entities, with robots and autonomous systems becoming commonplace in daily life [26][27]. - NVIDIA's ongoing projects, such as Omniverse and Cosmos, aim to create advanced generative systems that will significantly impact robotics and physical systems [37][38]. Group 5: Energy Efficiency and Limitations - The company emphasizes the importance of energy efficiency in computing, having achieved a remarkable 10,000-fold increase in energy efficiency for AI computations since 2016 [32][33]. - Current physical limitations in computing are acknowledged, with a focus on improving energy efficiency to enhance computational capabilities [31][32].