一个大脑搞定所有模态，百度ERNIE 5.0技术报告公布

Core Insights - The article discusses the release of the technical report for Baidu's ERNIE 5.0 model, highlighting its innovative architecture and performance metrics [1][3]. Group 1: Model Architecture - ERNIE 5.0 utilizes a super sparse Ultra-Sparse MoE architecture with a parameter count reaching trillions, but only less than 3% of these parameters are activated during inference, making it the first publicly available unified autoregressive model of this scale [3]. - The model achieves true native autoregressive unification across four modalities without relying on "splicing," allowing all modalities to operate within the same Transformer Backbone from the ground up [4]. - ERNIE 5.0 employs a modality-agnostic expert routing mechanism, breaking down barriers between different data modalities and eliminating the need for pre-labeled data [7]. Group 2: Expert Pool and Specialization - A shared expert pool is constructed in ERNIE 5.0, enabling data from all modalities to flow freely within a massive parameter network [8]. - The model exhibits emergent specialization, where experts autonomously develop roles such as visual experts or text logic specialists without any predefined instructions [12][13]. - This implicit collaboration enhances multimodal understanding and naturally extends the model's capabilities [14]. Group 3: Training Paradigm - ERNIE 5.0 introduces a flexible training paradigm that allows for the generation of multiple models from a single pre-training session, significantly saving time and computational resources [15]. - The model incorporates an Elastic Depth mechanism, allowing for random skipping of Transformer layers during training, enabling shallow networks to perform computations independently [17]. - It also supports dynamic adjustments in expert pool capacity and the number of active experts during inference, balancing between full-scale trillion-parameter models and lightweight deployments [18]. Group 4: Post-Training Optimization - The model implements a unified multimodal reinforcement learning strategy to optimize logical reasoning, instruction following, and multimodal generation tasks collaboratively [21]. - Techniques such as unbiased replay buffer and multi-granularity importance sampling are employed to enhance training efficiency and stability [23]. - Adaptive hint reinforcement learning is used to guide the model during the initial training phase, facilitating a smooth transition to independent problem-solving capabilities [23]. Group 5: Technical Details - The report details specific handling strategies for various modalities, including positional encoding variants for text, spatiotemporal patching for images and videos, and discrete coding for audio signals [24]. - Communication optimization strategies for the underlying PaddlePaddle framework on large clusters and efficient attention mechanisms for long contexts are also discussed [24].