基于能量的Transformer横空出世！全面超越主流模型35%

Core Viewpoint - The article discusses the introduction of the Energy-Based Transformers (EBT) architecture by a team from the University of Virginia, which surpasses the Transformer++ model across multiple dimensions, including data, parameters, computation, and model depth, through a novel energy mechanism [1][3][28]. Summary by Sections EBT Architecture and Performance - EBT achieves approximately 35% improvement over Transformer++ in various dimensions such as data volume, batch size, parameter count, computation, and model depth [3]. - During inference, EBT shows a 29% enhancement in performance compared to Transformer++ [7]. - EBT is designed to simulate human-like thinking by minimizing energy through a gradient descent process, allowing the model to determine the number of "thinking steps" dynamically [13][14]. Energy-Based Models (EBM) - EBT is developed based on the principles of Energy-Based Models (EBM), which assign a scalar value to each input configuration through an energy function [15][16]. - Lower energy indicates higher compatibility or probability among input variables, while higher energy suggests lower compatibility [17][18]. - The challenge of large-scale training in EBM remains unresolved, with two primary training methods identified: contrastive learning and regularization methods [19][20]. Training and Scalability - The research team transformed EBM learning into an optimization problem, effectively avoiding the curse of dimensionality and enabling scalable learning [22]. - EBT includes two variants: bidirectional EBT, which is simpler to implement, and autoregressive EBT, which is more complex due to information leakage issues [26]. Comparative Analysis - EBT consistently outperforms Transformer++ across six different dimensions, becoming the first model to achieve multi-dimensional superiority without changing the tokenizer [27][28]. - As training time increases, EBT's thinking capability improves, with performance gains rising from 4%-8% to 10%-14% [28]. - EBT outperforms diffusion models in image denoising tasks while reducing the required forward computation by 99% [32]. Implications and Future Directions - EBT introduces a new approach to implementing System 2 thinking through an energy-based optimization mechanism, demonstrating strong scalability and generalization capabilities [34].