Workflow
Transformer模型
icon
Search documents
Nature子刊:上海科学智能研究院漆远/曹风雷/徐丽成团队开发新型AI模型,用于化学反应性能预测和合成规划
生物世界· 2025-08-24 08:30
Core Viewpoint - Artificial Intelligence (AI) has significantly transformed the field of precise organic synthesis, showcasing immense potential in predicting reaction performance and synthesis planning through data-driven methods, including machine learning and deep learning [2][3]. Group 1: Research Overview - A recent study published in Nature Machine Intelligence introduces a unified pre-trained deep learning framework called RXNGraphormer, which integrates Graph Neural Networks (GNN) and Transformer models to address the methodological discrepancies between reaction performance prediction and synthesis planning [3][5]. - The RXNGraphormer framework is designed to collaboratively handle both reaction performance prediction and synthesis planning tasks through a unified pre-training approach [5][7]. Group 2: Performance and Training - The RXNGraphormer model was trained on 13 million chemical reactions and achieved state-of-the-art (SOTA) performance across eight benchmark datasets in reaction activity/selectivity prediction and forward/reverse synthesis planning, as well as on three external real-world datasets [5][7]. - Notably, the chemical feature embeddings generated by the model can autonomously cluster by reaction type in an unsupervised manner [5].
中金:一种结合自注意力机制的GRU模型
中金点睛· 2025-07-14 23:39
Core Viewpoint - The article discusses the evolution and optimization of time series models, particularly focusing on GRU and Transformer architectures, and introduces a new model called AttentionGRU(Res) that combines the strengths of both [1][6][49]. Group 1: Time Series Models Overview - Time series models, such as LSTM, GRU, and Transformer, are designed for analyzing and predicting sequential data, effectively addressing long-term dependencies through specialized gating mechanisms [1][8]. - GRU, as an optimized variant, enhances computational efficiency while maintaining long-term memory capabilities, making it suitable for real-time prediction scenarios [2][4]. - The Transformer model revolutionizes sequence modeling through self-attention mechanisms and position encoding, demonstrating significant advantages in analyzing multi-dimensional time series data [2][4]. Group 2: Performance Comparison of Factors - A systematic test of 159 cross-sectional factors and 158 time series factors revealed that while cross-sectional factors generally outperform time series factors, the latter showed better out-of-sample performance when used in RNN, LSTM, and GRU models [4][21]. - The average ICIR (Information Coefficient Information Ratio) for time series factors was found to be higher than that of cross-sectional factors, indicating better predictive performance despite a more dispersed distribution [4][20]. - In terms of returns, cross-sectional factors yielded a long-short excess return of 11%, compared to only 1% for time series factors, highlighting the differences in performance metrics [4][20]. Group 3: Model Optimization Strategies - The article explores various optimization strategies for time series models, including adjustments to the propagation direction of time series, optimization of gating structures, and overall structural combinations [5][27]. - Testing of BiGRU and GLU models showed limited improvement over the standard GRU model, while the Transformer model exhibited significant in-sample performance but suffered from overfitting in out-of-sample tests [5][28]. - The proposed AttentionGRU(Res) model combines a simplified self-attention mechanism with GRU, achieving a balance between performance and stability, resulting in an annualized excess return of over 30% in the full market [6][40][41]. Group 4: AttentionGRU(Res) Model Performance - The AttentionGRU(Res) model demonstrated strong performance, achieving a near 12.6% annualized excess return over the past five years in rolling samples, indicating its robustness in various market conditions [6][49]. - The model's generalization ability was validated within the CSI 1000 stock range, yielding an annualized excess return of 10.8%, outperforming traditional GRU and Transformer structures [6][46][49]. - The integration of residual connections and simplified self-attention structures in the AttentionGRU(Res) model significantly improved training stability and predictive performance [35][40].
一种新型的超大规模光电混合存算方案
半导体行业观察· 2025-06-29 01:51
Core Viewpoint - The article discusses the development of a novel 2T1M optoelectronic hybrid computing architecture that addresses the IR drop issue in traditional CIM architectures, enabling larger array sizes and improved performance for deep learning applications, particularly in large-scale Transformer models [1][2][9]. Group 1: Architecture Design and Working Principle - The 2T1M architecture integrates electronic and photonic technologies to mitigate IR drop issues, utilizing a combination of two transistors and a modulator in each storage unit [2]. - The architecture employs FeFETs for multiplication operations, which exhibit low static power consumption and excellent linear characteristics in the subthreshold region [2]. - FeFETs demonstrate a sub-pA cutoff current and are expected to maintain performance over 10 years with over 10 million cycles [2]. Group 2: Optoelectronic Conversion and Lossless Summation - The architecture utilizes lithium niobate (LN) modulators for converting electrical signals to optical signals, leveraging the Pockels effect to achieve phase shifts in light signals [4][6]. - The integration of multiple 2T1M units in a Mach-Zehnder interferometer allows for effective accumulation of phase shifts, enabling lossless summation of vector-matrix multiplication results [4][6]. Group 3: Transformer Application - Experimental results indicate that the 2T1M architecture achieves a 93.3% inference accuracy when running the ALBERT model, significantly outperforming traditional CIM architectures, which only achieve 48.3% accuracy under the same conditions [9]. - The 2T1M architecture supports an array size of up to 3750kb, which is over 150 times larger than traditional CIM architectures limited to 256kb due to IR drop constraints [9]. - The architecture's power efficiency is reported to be 164 TOPS/W, representing a 37-fold improvement over state-of-the-art traditional CIM architectures, which is crucial for enhancing energy efficiency in edge computing and data centers [9].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].
哈佛新论文揭示 Transformer 模型与人脑“同步纠结”全过程,AI也会犹豫、反悔?
3 6 Ke· 2025-05-12 00:22
Core Insights - A recent study from researchers at Harvard University, Brown University, and the University of Tübingen explores the similarities between the processing dynamics of Transformer models and real-time human cognition [1][2]. Group 1: Research Objectives - The study aims to investigate the internal processing of AI models, particularly how it compares to human cognitive processes, rather than just focusing on the final outputs [2][4]. - The authors propose to analyze the processing dynamics at each layer of the Transformer model to see if they align with the real-time information processing in the human brain [4][24]. Group 2: Methodology and Findings - The researchers recorded the outputs and changes at each layer of the Transformer model, introducing a series of "processing load" metrics to understand how confidence in answers evolves through the layers [7][24]. - The study found that both AI and humans exhibit similar patterns of hesitation and correction when faced with challenging questions, indicating a shared cognitive process [11][18][23]. Group 3: Specific Examples - In the "capital killer question," both AI and humans initially leaned towards incorrect answers (e.g., Chicago) before correcting themselves to the right answer (Springfield) [13][15]. - In animal classification tasks, both AI and humans showed a tendency to initially misclassify (e.g., thinking whales are fish) before arriving at the correct classification [18][19]. - The study also highlighted that in logical reasoning tasks, both AI and humans can be misled by common misconceptions, leading to similar paths of confusion before reaching the correct conclusion [21][24]. Group 4: Implications - The findings suggest that understanding the internal "thought processes" of AI could provide insights into human cognitive challenges and potentially guide experimental design in cognitive science [24].
为什么现在做AI的产品经理,都是在亏钱?
3 6 Ke· 2025-05-06 01:50
现在,我认识的AI产品经理几乎都是在做AI产品的功能迭代,很少有从0到1做AI产品的。 前段时间写过一篇文章,分享到AI产品经理做AI的产品无非就2个产品框架,一个是让用户找AI,一个是让AI找用户。 两者最大的区别就是前者是用户注册登录产品后,核心功能并不是AI,而后者是用户注册登录后,就会依托于AI模型能力完成基础操作,用户不需要关 心AI的功能入口在哪里。 有一点是比较遗憾的,现在不管是哪一类产品框架,做AI的产品经理都是在亏钱。 在国内也有其他非transformer的模型,比如我最近看到国内早期做搜索的团队,就本身发现了transformer,但是由于其幻觉与训练成本较高的缺陷,所以 就走向了另外一个模型,他们称之为yan模型,这个模型的架构特点并且还有所需要的资源特别低,适合手机终端等运行。 之所以,提到transformer模型并不是AI的最好架构,是因为在围绕着解决大模型幻觉的问题上,一些常见的通过强化RL学习做反馈但没有达到100%,实 际上这是错误的,就像AI的创始人杨立昆提到的:"这就像给一个破旧的汽车在不断地补漆,我们只是做好了表皮,而不关注汽车的内部,这样是修不好 的。" 基于此,可 ...
深度|英伟达黄仁勋:GPU是一台时光机,让人们看到未来;下一个十年AI将在某些领域超越人类的同时赋能人类
Z Potentials· 2025-03-01 03:53
Core Insights - NVIDIA has rapidly evolved into one of the world's most valuable companies due to its pioneering role in transforming computing through innovative chip and software designs, particularly in the AI era [2][3]. Group 1: Historical Context - The inception of NVIDIA was driven by the observation that a small portion of code in software could handle the majority of processing through parallel execution, leading to the development of the first modern GPU [3][4]. - The choice to focus on video games was strategic, as the gaming market was identified as a potential driver for technological advancements and a significant entertainment market [5][6]. Group 2: Technological Innovations - The introduction of CUDA allowed programmers to utilize familiar programming languages to harness GPU power, significantly broadening the accessibility of parallel processing capabilities [7][9]. - The success of AlexNet in 2012 marked a pivotal moment in AI, demonstrating the potential of GPUs in training deep learning models, which initiated a profound transformation in the AI landscape [11][12]. Group 3: Current Developments - Major breakthroughs in computer vision, speech recognition, and language understanding have been achieved in recent years, showcasing the rapid advancements in AI capabilities [14][15]. - NVIDIA is focusing on the application of AI in various fields, including digital biology, climate science, and robotics, indicating a shift towards practical applications of AI technology [21][38]. Group 4: Future Vision - The future of automation is anticipated to encompass all moving entities, with robots and autonomous systems becoming commonplace in daily life [26][27]. - NVIDIA's ongoing projects, such as Omniverse and Cosmos, aim to create advanced generative systems that will significantly impact robotics and physical systems [37][38]. Group 5: Energy Efficiency and Limitations - The company emphasizes the importance of energy efficiency in computing, having achieved a remarkable 10,000-fold increase in energy efficiency for AI computations since 2016 [32][33]. - Current physical limitations in computing are acknowledged, with a focus on improving energy efficiency to enhance computational capabilities [31][32].