登顶Hugging Face论文热榜，LLM重写数据准备的游戏规则

Core Insights - The article discusses the significant challenges faced by data teams in enterprise systems, particularly the outdated data preparation processes that consume nearly 80% of their time and effort [2][4]. - Traditional data preparation methods rely heavily on manual rules and expert knowledge, leading to inefficiencies and vulnerabilities when data formats change or cross-domain integration is required [8][9]. - The introduction of Large Language Models (LLMs) is transforming data preparation from a "rule-driven" to a "semantic-driven" paradigm, allowing for better understanding and processing of data [8][9]. Data Preparation Tasks - The article outlines three core tasks in data preparation enhanced by LLMs: - Data Cleaning: Involves error detection, format standardization, anomaly repair, and missing value imputation [10]. - Data Integration: Focuses on entity matching, schema matching, and cross-source alignment [10]. - Data Enrichment: Includes column type recognition, semantic labeling, and constructing table-level and database-level profiles [10]. Methodological Framework - A task-centered classification framework is proposed, categorizing LLM-enhanced data preparation into three main methodological approaches: - Prompt-based Methods: Utilize structured prompts and contextual examples to guide models in standardization and labeling tasks, emphasizing flexibility and low development costs [12]. - Retrieval-Augmented and Hybrid Methods: Combine retrieval-augmented generation (RAG) with model fine-tuning and traditional rule systems to balance cost, scale, and stability [12]. - Agentic Workflow Methods: Position LLMs as coordinating hubs that call external tools and sub-models to build complex data processing workflows, exploring automation and autonomous decision-making [12]. Engineering Insights - The article highlights several practical observations for engineering teams: - Prompt-based methods are suitable for small-scale, high-complexity tasks but struggle with cost and consistency in large-scale scenarios [19]. - RAG and hybrid systems are becoming mainstream choices, allowing LLMs to focus on complex semantic decisions while handling high-frequency, low-difficulty tasks with other systems [19]. - The agentic approach is still in exploration, with potential in complex workflows but facing challenges in stability and evaluability [19]. Evaluation and Benchmarking - The article provides an overview of representative datasets and benchmarks used to evaluate LLM data preparation capabilities, covering various tasks and modalities [20][21]. - It notes that current benchmarks primarily focus on small to medium-sized structured data, limiting the ability to compare methods in real-world enterprise scenarios [21]. Challenges and Future Directions - The article identifies key challenges in scaling LLM applications, including high inference costs, stability issues, and the need for a unified evaluation framework [23]. - It suggests that a more realistic approach is to integrate LLMs as "semantic coordinators" within existing data pipelines rather than completely replacing traditional systems [24].