无Tokenizer时代真要来了？Mamba作者再发颠覆性论文，挑战Transformer

Core Viewpoint - The article discusses the potential of a new hierarchical network model, H-Net, which replaces traditional tokenization with a dynamic chunking process, suggesting a shift towards end-to-end language models without tokenizers [3][4][22]. Group 1: Tokenization and Its Limitations - Tokenization is currently essential for language models, compressing and shortening sequences, but it has drawbacks such as poor interpretability and decreased performance with complex languages like Chinese, code, and DNA sequences [5]. - No end-to-end model without tokenization has yet surpassed the performance of tokenizer-based models under equivalent computational budgets [6]. Group 2: H-Net Model Overview - H-Net employs a hierarchical architecture that processes data in three steps: fine processing, compression abstraction, and output restoration [14][16]. - The core of H-Net is the dynamic chunking (DC) mechanism, which learns how to segment data using standard differentiable optimization methods [18][19]. - H-Net has shown superior performance compared to strong Transformer models based on BPE tokenization, achieving better data efficiency and robustness, especially in languages where tokenization methods are less effective [8][10][30]. Group 3: Experimental Results - In experiments, H-Net demonstrated significant improvements in character-level robustness and the ability to learn meaningful, data-dependent chunking strategies without heuristic rules or explicit supervision [9][10]. - H-Net's performance is comparable to that of BPE tokenized Transformers, with the potential to outperform them in certain scenarios, particularly in zero-shot accuracy across various downstream benchmarks [32][34]. - The model's ability to handle Chinese and code processing was notably better than BPE Transformers, indicating its scalability and efficiency [36][39].