Hugging Face 发布 FinePDFs：基于 PDF 文档构建的 3 万亿 Token 数据集

Core Insights - Hugging Face has launched FinePDFs, the world's largest pure PDF public corpus, encompassing 4.75 billion documents in 1,733 languages, totaling approximately 30 trillion tokens [2] - FinePDFs offers unique advantages over traditional HTML-based datasets, particularly in high-quality, domain-specific content extraction from legal, academic, and technical writing [2] - The dataset employs advanced techniques for text extraction, including Docling for text extraction and RolmOCR for GPU-driven OCR, ensuring high-quality data processing [2] Summary by Sections Dataset Composition - The dataset includes over 1.1 trillion tokens in English, with Spanish, German, French, Russian, and Japanese each contributing over 100 billion tokens [3] - It also represents smaller languages, with 978 languages contributing over 1 million tokens [3] Performance Evaluation - Hugging Face trained a 1.67 billion parameter model on a subset of FinePDFs, achieving performance comparable to the state-of-the-art HTML dataset SmolLM-3 Web [3] - Combining both datasets significantly improved performance, highlighting the complementary knowledge that PDFs can provide [3] Community Response and Transparency - The evaluation results have sparked questions within the community regarding the assessment methodology and scoring [4] - Hugging Face emphasizes the dataset's potential for advancing long-context training due to the typically longer nature of PDF documents compared to web pages [4] - The dataset is available under an open data sharing license for research and development, hosted on Hugging Face Hub [4]