Workflow
Transformer模型
icon
Search documents
蔡崇信复盘阿里AI:“早”做,不等于领先
3 6 Ke· 2026-02-07 02:22
Core Insights - Alibaba's Chairman, Joe Tsai, acknowledged that the company started working on Transformer models in 2019 but failed to allocate sufficient resources for their development until the launch of Tongyi Qianwen in 2023, marking a significant entry into the AI race [1][5][24] Group 1: Adoption - The first key point emphasized by Tsai is that AI must be used in practical scenarios to generate real value, not just developed as models [6][7] - The Tongyi App is crucial in Alibaba's AI strategy, serving not only as a user interface but also as a test for the AI's capabilities in real-world applications [8][11] - The unique characteristics of the Chinese market, such as the lower acceptance of enterprise software payment models compared to the U.S., necessitate alternative paths for AI adoption, making the Tongyi App a vital attempt to ensure real usage of models [9][10] Group 2: Scale - Tsai pointed out that AI investment is shifting focus from training to inference, with major tech companies increasing their capital expenditures from $60-80 billion to $120-150 billion annually [12][12] - Inference is identified as the main battleground for AI costs, as it is a daily requirement for users and businesses, unlike training which occurs less frequently [13][14] - The ability to handle high concurrency and maintain stability under load is crucial for scaling AI models, with Alibaba opting to deploy models on its own cloud infrastructure to control performance and throughput [15][16] Group 3: Open Source - Tsai advocates for open source as a practical choice rather than an idealistic one, driven by the commercial landscape and market conditions in China [17][18] - The primary value of open source is not cost but sovereignty, allowing companies and developers to have full control over their models [18][20] - Alibaba's strategy involves making Tongyi Qianwen open source while encouraging users to utilize Alibaba Cloud for training and inference, creating a commercial loop where infrastructure usage generates revenue [22][23]
哈佛辍学生拿下5亿美元融资:不造GPU,也要“绕开”英伟达
是说芯语· 2026-01-15 23:37
Core Insights - Etched, an AI chip company founded by Harvard dropouts, has raised nearly $500 million in a new funding round, achieving a valuation of $5 billion and total funding close to $1 billion [1][12] - The company aims to optimize the cost-performance ratio of AI computing, specifically focusing on running Transformer models more efficiently rather than competing directly with Nvidia's general-purpose GPUs [1][4] Market Context - Nvidia dominates the GPU market, with projected data center sales exceeding $500 billion by the end of 2026 [3] - Etched's analysis indicates that computational density has only improved by about 15% over the past few years, highlighting a need for more efficient solutions [3] Product Overview - Etched has developed a custom chip named Sohu, designed specifically for Transformer architecture, claiming it to be the "fastest AI chip ever" [3][10] - Under specific testing conditions, Sohu can process over 500,000 tokens per second when running the Llama 70B model, outperforming Nvidia's Blackwell GB200 GPU by an order of magnitude [3][4] Competitive Advantage - A server composed of eight Sohu chips can replace 160 H100 GPUs, offering a more economical, efficient, and environmentally friendly option for enterprises requiring specialized chips [5] - Sohu's design focuses on reducing energy consumption while achieving higher efficiency in running Transformer models, distinguishing it from general-purpose GPUs [5][10] Financial Implications - The cost of training AI models exceeds $1 billion, with inference applications potentially surpassing $10 billion; even a 1% performance improvement can justify a custom chip project costing between $50 million to $100 million [5][7] Future Prospects - Etched's chip is manufactured using TSMC's 4nm process and is integrated with HBM memory and server hardware to support production capabilities [10] - The company has plans to expand its technology beyond text generation to include image and video generation, as well as protein folding simulations [16] Industry Landscape - Other companies, such as Meta and Amazon, are also developing specialized AI chips, but Etched's approach focuses solely on Transformer models, avoiding unnecessary hardware components and software overhead [10][17] - The success of Etched hinges on the continued relevance of Transformer models in the AI landscape; a shift away from this architecture could necessitate a reevaluation of their strategy [18]
GPT在模仿人类?Nature发现:大脑才是最早的Transformer
3 6 Ke· 2025-12-11 10:48
【导读】我们以为语言是语法、规则、结构。但最新的Nature研究却撕开了这层幻觉。GPT的层级结构与竟与人大脑里的「时间印记」一模一样。当浅 层、中层、深层在脑中依次点亮,我们第一次看见:理解语言,也许从来不是解析,而是预测。 我们一直深信,人类的大脑,是靠着一套严谨的规则、语法和结构分析来理解语言的,复杂且独一无二。 这是一个持续了数十年的「共识」。 可最近Nature Communications上发表的一项颠覆性研究,把这个古老的信仰掀了个底朝天。 论文地址:https://www.nature.com/articles/s41467-025-65518-0 研究者们让受试者听30分钟的故事,同时用毫秒级的脑电技术,精准捕捉大脑对每个词语的反应。 接着,他们将同样的故事文本输入给大语言模型,比如GPT-2和Llama-2,提取模型每一层对文本的内部理解表示 令人震惊的实验结果出现了: GPT那看似冰冷的层级结构,竟然在人类的大脑里,找到了完美的时间对应关系。 过去,我们总以为是GPT在模仿人类。但这个实验却给出了一个石破天惊的暗示: 或许,我们的大脑,天然就长成了「GPT」的样子。 GPT的结构,能在大 ...
NeurIPS 2025 | DePass:通过单次前向传播分解实现统一的特征归因
机器之心· 2025-12-01 04:08
Core Viewpoint - The article discusses the introduction of a new unified feature attribution framework called DePass, which aims to enhance the interpretability of large language models (LLMs) by providing precise attribution of model outputs to internal computations [3][11]. Group 1: Introduction of DePass - DePass is a novel framework developed by a research team from Tsinghua University and Shanghai AI Lab, designed to address the challenges of existing attribution methods that are often computationally expensive and lack a unified analysis framework [3][6]. - The framework allows for the decomposition of hidden states in the forward pass into additive components, enabling precise attribution of model behavior without modifying the model structure [7][11]. Group 2: Implementation Details - In the Attention module, DePass freezes attention scores and applies linear transformations to the hidden states, allowing for accurate distribution of information flow [8]. - For the MLP module, it treats the neurons as a key-value store, effectively partitioning the contributions of different components to the same token [9]. Group 3: Experimental Validation - DePass has been validated through various experiments, demonstrating its effectiveness in token-level, model-component-level, and subspace-level attribution tasks [11][13]. - In token-level experiments, removing the most critical tokens identified by DePass significantly decreased model output probabilities, indicating its ability to capture essential evidence driving predictions [11][14]. Group 4: Comparison with Existing Methods - Existing attribution methods, such as noise ablation and gradient-based methods, face challenges in providing fine-grained explanations and often incur high computational costs [12]. - DePass outperforms traditional importance metrics in identifying significant components, showing higher sensitivity and completeness in its attribution results [15]. Group 5: Applications and Future Potential - DePass can track the contributions of specific input tokens to particular semantic subspaces, enhancing the model's controllability and interpretability [13][19]. - The framework is expected to serve as a universal tool in mechanism interpretability research, facilitating exploration across various tasks and models [23].
为Transformer注入长期记忆:Memo框架通过“学会做摘要”解决具身智能核心挑战
机器人大讲堂· 2025-10-29 10:03
Core Insights - The article discusses the limitations of Transformer models in handling long-term memory tasks and introduces Memo, a new architecture designed to enhance memory efficiency in long-sequence reinforcement learning tasks [1][3][18] Group 1: Memo Framework - Memo mimics human note-taking behavior by allowing the model to autonomously generate and store summaries of past experiences, enabling efficient retrieval of long-term memory with minimal memory overhead [3][5] - The framework processes long input sequences in segments and generates a fixed number of optimized summary tokens at the end of each segment [4][5] Group 2: Technical Implementation - Memo employs a special attention masking mechanism to ensure the model accesses past information only through summary tokens, creating a conscious information bottleneck [6] - It utilizes flexible positional encoding to help the model understand the temporal position of observations and summaries, which is crucial for causal relationships [6] - The introduction of segment length randomization during training enhances the model's adaptability to varying task rhythms [6] Group 3: Experimental Validation - Memo was tested in two embodied intelligence scenarios: the ExtObjNav task and the Dark-Key-To-Door task, comparing its performance against baseline models like Full Context Transformer (FCT) and Recurrent Memory Transformer (RMT) [7][11] - In the ExtObjNav task, Memo demonstrated superior performance, reducing the number of context tokens used by 8 times while maintaining strong reasoning capabilities beyond the training sequence length [9] - In the Dark-Key-To-Door task, Memo consistently remembered the locations of the key and door, while FCT showed significant performance decline after a certain number of steps, highlighting the challenges faced by full-context models [11] Group 4: Key Findings from Ablation Studies - The cumulative memory mechanism of Memo outperforms fixed memory models, akin to human wisdom accumulation rather than relying solely on recent experiences [14] - Long-range gradient propagation is essential for effective memory utilization, as limiting gradients to short-term memory significantly degrades performance [17] - An optimal summary length of 32 tokens strikes a balance between information compression and retention, as excessive summary tokens can introduce redundancy and noise [17] Group 5: Conclusion and Future Directions - Memo represents a significant advancement towards more efficient and intelligent long-term reasoning in AI, allowing models to autonomously manage their attention and memory [18] - The memory mechanism has broad applications, including autonomous navigation robots and personalized systems that understand long-term user preferences [18] - Future research will focus on enhancing the adaptability and interpretability of memory mechanisms, as well as balancing memory stability and flexibility [18]
Nature子刊:上海科学智能研究院漆远/曹风雷/徐丽成团队开发新型AI模型,用于化学反应性能预测和合成规划
生物世界· 2025-08-24 08:30
Core Viewpoint - Artificial Intelligence (AI) has significantly transformed the field of precise organic synthesis, showcasing immense potential in predicting reaction performance and synthesis planning through data-driven methods, including machine learning and deep learning [2][3]. Group 1: Research Overview - A recent study published in Nature Machine Intelligence introduces a unified pre-trained deep learning framework called RXNGraphormer, which integrates Graph Neural Networks (GNN) and Transformer models to address the methodological discrepancies between reaction performance prediction and synthesis planning [3][5]. - The RXNGraphormer framework is designed to collaboratively handle both reaction performance prediction and synthesis planning tasks through a unified pre-training approach [5][7]. Group 2: Performance and Training - The RXNGraphormer model was trained on 13 million chemical reactions and achieved state-of-the-art (SOTA) performance across eight benchmark datasets in reaction activity/selectivity prediction and forward/reverse synthesis planning, as well as on three external real-world datasets [5][7]. - Notably, the chemical feature embeddings generated by the model can autonomously cluster by reaction type in an unsupervised manner [5].
中金:一种结合自注意力机制的GRU模型
中金点睛· 2025-07-14 23:39
Core Viewpoint - The article discusses the evolution and optimization of time series models, particularly focusing on GRU and Transformer architectures, and introduces a new model called AttentionGRU(Res) that combines the strengths of both [1][6][49]. Group 1: Time Series Models Overview - Time series models, such as LSTM, GRU, and Transformer, are designed for analyzing and predicting sequential data, effectively addressing long-term dependencies through specialized gating mechanisms [1][8]. - GRU, as an optimized variant, enhances computational efficiency while maintaining long-term memory capabilities, making it suitable for real-time prediction scenarios [2][4]. - The Transformer model revolutionizes sequence modeling through self-attention mechanisms and position encoding, demonstrating significant advantages in analyzing multi-dimensional time series data [2][4]. Group 2: Performance Comparison of Factors - A systematic test of 159 cross-sectional factors and 158 time series factors revealed that while cross-sectional factors generally outperform time series factors, the latter showed better out-of-sample performance when used in RNN, LSTM, and GRU models [4][21]. - The average ICIR (Information Coefficient Information Ratio) for time series factors was found to be higher than that of cross-sectional factors, indicating better predictive performance despite a more dispersed distribution [4][20]. - In terms of returns, cross-sectional factors yielded a long-short excess return of 11%, compared to only 1% for time series factors, highlighting the differences in performance metrics [4][20]. Group 3: Model Optimization Strategies - The article explores various optimization strategies for time series models, including adjustments to the propagation direction of time series, optimization of gating structures, and overall structural combinations [5][27]. - Testing of BiGRU and GLU models showed limited improvement over the standard GRU model, while the Transformer model exhibited significant in-sample performance but suffered from overfitting in out-of-sample tests [5][28]. - The proposed AttentionGRU(Res) model combines a simplified self-attention mechanism with GRU, achieving a balance between performance and stability, resulting in an annualized excess return of over 30% in the full market [6][40][41]. Group 4: AttentionGRU(Res) Model Performance - The AttentionGRU(Res) model demonstrated strong performance, achieving a near 12.6% annualized excess return over the past five years in rolling samples, indicating its robustness in various market conditions [6][49]. - The model's generalization ability was validated within the CSI 1000 stock range, yielding an annualized excess return of 10.8%, outperforming traditional GRU and Transformer structures [6][46][49]. - The integration of residual connections and simplified self-attention structures in the AttentionGRU(Res) model significantly improved training stability and predictive performance [35][40].
一种新型的超大规模光电混合存算方案
半导体行业观察· 2025-06-29 01:51
Core Viewpoint - The article discusses the development of a novel 2T1M optoelectronic hybrid computing architecture that addresses the IR drop issue in traditional CIM architectures, enabling larger array sizes and improved performance for deep learning applications, particularly in large-scale Transformer models [1][2][9]. Group 1: Architecture Design and Working Principle - The 2T1M architecture integrates electronic and photonic technologies to mitigate IR drop issues, utilizing a combination of two transistors and a modulator in each storage unit [2]. - The architecture employs FeFETs for multiplication operations, which exhibit low static power consumption and excellent linear characteristics in the subthreshold region [2]. - FeFETs demonstrate a sub-pA cutoff current and are expected to maintain performance over 10 years with over 10 million cycles [2]. Group 2: Optoelectronic Conversion and Lossless Summation - The architecture utilizes lithium niobate (LN) modulators for converting electrical signals to optical signals, leveraging the Pockels effect to achieve phase shifts in light signals [4][6]. - The integration of multiple 2T1M units in a Mach-Zehnder interferometer allows for effective accumulation of phase shifts, enabling lossless summation of vector-matrix multiplication results [4][6]. Group 3: Transformer Application - Experimental results indicate that the 2T1M architecture achieves a 93.3% inference accuracy when running the ALBERT model, significantly outperforming traditional CIM architectures, which only achieve 48.3% accuracy under the same conditions [9]. - The 2T1M architecture supports an array size of up to 3750kb, which is over 150 times larger than traditional CIM architectures limited to 256kb due to IR drop constraints [9]. - The architecture's power efficiency is reported to be 164 TOPS/W, representing a 37-fold improvement over state-of-the-art traditional CIM architectures, which is crucial for enhancing energy efficiency in edge computing and data centers [9].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].
哈佛新论文揭示 Transformer 模型与人脑“同步纠结”全过程,AI也会犹豫、反悔?
3 6 Ke· 2025-05-12 00:22
Core Insights - A recent study from researchers at Harvard University, Brown University, and the University of Tübingen explores the similarities between the processing dynamics of Transformer models and real-time human cognition [1][2]. Group 1: Research Objectives - The study aims to investigate the internal processing of AI models, particularly how it compares to human cognitive processes, rather than just focusing on the final outputs [2][4]. - The authors propose to analyze the processing dynamics at each layer of the Transformer model to see if they align with the real-time information processing in the human brain [4][24]. Group 2: Methodology and Findings - The researchers recorded the outputs and changes at each layer of the Transformer model, introducing a series of "processing load" metrics to understand how confidence in answers evolves through the layers [7][24]. - The study found that both AI and humans exhibit similar patterns of hesitation and correction when faced with challenging questions, indicating a shared cognitive process [11][18][23]. Group 3: Specific Examples - In the "capital killer question," both AI and humans initially leaned towards incorrect answers (e.g., Chicago) before correcting themselves to the right answer (Springfield) [13][15]. - In animal classification tasks, both AI and humans showed a tendency to initially misclassify (e.g., thinking whales are fish) before arriving at the correct classification [18][19]. - The study also highlighted that in logical reasoning tasks, both AI and humans can be misled by common misconceptions, leading to similar paths of confusion before reaching the correct conclusion [21][24]. Group 4: Implications - The findings suggest that understanding the internal "thought processes" of AI could provide insights into human cognitive challenges and potentially guide experimental design in cognitive science [24].