特征工程

Search documents
一文读懂深度表格数据表示学习 | 南京大学
量子位· 2025-06-25 00:33
Core Viewpoint - The article emphasizes the growing importance of tabular data in AI applications across various sectors, including finance, healthcare, education, recommendation systems, and scientific research [1]. Group 1: Background and Importance of Tabular Data - Tabular data is fundamentally a structured representation of information, offering inherent advantages in organizing and expressing complex data relationships [3]. - The rise of deep learning has led to significant advancements in fields like computer vision and natural language processing, making the application of deep neural networks (DNN) to tabular data a research hotspot [6]. Group 2: Deep Learning Approaches to Tabular Data - The research categorizes deep learning methods for tabular data into three types: specialized methods, transferable methods, and general methods, reflecting the evolution of deep learning technology and the enhancement of model generalization capabilities [7][19]. - Specialized methods are the earliest and most widely used, focusing on obtaining high-quality representations from feature and sample levels [9]. - Transferable methods leverage pre-trained models to improve learning efficiency and reduce reliance on computational resources and data scale [12]. - General methods extend the generalization ability of pre-trained tabular models to various heterogeneous downstream tasks without additional fine-tuning [19]. Group 3: Challenges in Tabular Data Learning - Tabular data presents unique challenges, including feature heterogeneity, lack of spatial or sequential structure, low-quality and missing data, and the importance of feature engineering [22][23][25][26]. - The presence of class imbalance in many tabular datasets can lead to biased predictions, necessitating specific strategies for model training [27]. - Scalability to large datasets poses additional challenges, particularly as dimensionality increases, raising the risk of overfitting [28]. Group 4: Evaluation and Benchmarking - The article discusses the importance of robust evaluation methods for tabular models, highlighting the need for diverse benchmark datasets to assess model performance across different tasks and feature types [36]. - Performance evaluation metrics for classification tasks include accuracy, AUC, and F1 score, while regression tasks typically use MSE, MAE, and R² [32][33]. - Recent research emphasizes the need for comprehensive benchmarks that include semantically rich datasets to enhance the evaluation of tabular models [38][39].
整合多源植物转录组数据,山东理工大学等构建PlantLncBoost模型,跨物种lncRNA预测准确率最高达96%
3 6 Ke· 2025-06-18 07:44
Core Insights - The research team, including Shandong University of Technology and several international institutions, developed the PlantLncBoost model to address the challenges of identifying plant lncRNA [1][3][24] - The model achieved an average prediction accuracy of 91.7% across 12 different plant datasets, outperforming existing tools by 18.2% [3][17] - The study highlights the importance of lncRNA in plant growth, development, and environmental adaptation, emphasizing its role in regulating flowering time and responding to climate change [1][2] Model Development - The PlantLncBoost model incorporates 219 novel sequence descriptors based on mathematical theories such as Fourier transform and Shannon entropy [3][6] - The model was trained using a dataset of 24,152 lncRNA sequences from nine angiosperm species, ensuring high reliability through strict quality control [4][7] - Feature selection involved recursive feature elimination (RFE) to identify three core parameters with cross-species discrimination capability [3][11] Performance Evaluation - The model was validated using two key test sets: a comprehensive test set covering 20 diverse plant species and a high-confidence experimental validation set [5][19] - PlantLncBoost demonstrated superior performance with sensitivity at 98.42%, specificity at 94.93%, and accuracy at 96.63%, significantly surpassing other mainstream models [22][21] - The model's ROC curve achieved an AUC of 98.35%, indicating its effectiveness in prediction [19][22] Feature Engineering - A total of 1,662 features were extracted, including traditional sequence-based metrics and innovative mathematical features, enhancing the model's ability to identify lncRNA [6][15] - The model's performance peaked with a lightweight feature set, confirming the effectiveness of using a minimal number of key features [13][15] Collaborative Efforts - The research reflects a growing trend of collaboration between academic institutions and enterprises in advancing plant lncRNA research and applications [24][26] - Innovations in lncRNA research are expected to contribute to sustainable agricultural practices and ecological balance [26][27]