Workflow
特征工程
icon
Search documents
一文读懂深度表格数据表示学习 | 南京大学
量子位· 2025-06-25 00:33
南京大学博士生蒋俊鹏 投稿 量子位 | 公众号 QbitAI 在AI应用中,表格数据的重要性愈发凸显,广泛应用于金融、医疗健康、教育、推荐系统及科学研究领域。 深度神经网络(DNN)凭借其强大的表示学习能力,在表格数据建模上展现出令人瞩目的潜力。 南京大学团队系统介绍了表格表示学习这一研究领域,他们将现有方法按泛化能力划分为三大类: 专用模型 (Specialized)、可迁移模型(Transferable)和通用模型(General) 。 除此之外,他们还比较了DNN与传统方法——树模型的优劣,并剖析表格数据学习中的核心挑战,讨论了集 成学习方法以及开放环境下的表格学习和多模态表格任务等扩展方向。同时,考虑到不同数据集之间方法表现 差异显著,研究团队还探讨了数据集收集、评估与分析的系统策略,旨在建立跨数据集的稳健评估体系。 背景 表格数据本质上是一种 结构化的信息表示方式 ,在组织与表达复杂数据关系方面具有天然优势。 此研究聚焦于 有监督的表格机器学习任务 ,主要包括分类与回归两类常见问题。 除了结构化的组织形式外,表格数据通常还具有 属性类型异质性 ,即包含数值型、类别型或混合型等多种数 据类型,且这些数 ...
整合多源植物转录组数据,山东理工大学等构建PlantLncBoost模型,跨物种lncRNA预测准确率最高达96%
3 6 Ke· 2025-06-18 07:44
Core Insights - The research team, including Shandong University of Technology and several international institutions, developed the PlantLncBoost model to address the challenges of identifying plant lncRNA [1][3][24] - The model achieved an average prediction accuracy of 91.7% across 12 different plant datasets, outperforming existing tools by 18.2% [3][17] - The study highlights the importance of lncRNA in plant growth, development, and environmental adaptation, emphasizing its role in regulating flowering time and responding to climate change [1][2] Model Development - The PlantLncBoost model incorporates 219 novel sequence descriptors based on mathematical theories such as Fourier transform and Shannon entropy [3][6] - The model was trained using a dataset of 24,152 lncRNA sequences from nine angiosperm species, ensuring high reliability through strict quality control [4][7] - Feature selection involved recursive feature elimination (RFE) to identify three core parameters with cross-species discrimination capability [3][11] Performance Evaluation - The model was validated using two key test sets: a comprehensive test set covering 20 diverse plant species and a high-confidence experimental validation set [5][19] - PlantLncBoost demonstrated superior performance with sensitivity at 98.42%, specificity at 94.93%, and accuracy at 96.63%, significantly surpassing other mainstream models [22][21] - The model's ROC curve achieved an AUC of 98.35%, indicating its effectiveness in prediction [19][22] Feature Engineering - A total of 1,662 features were extracted, including traditional sequence-based metrics and innovative mathematical features, enhancing the model's ability to identify lncRNA [6][15] - The model's performance peaked with a lightweight feature set, confirming the effectiveness of using a minimal number of key features [13][15] Collaborative Efforts - The research reflects a growing trend of collaboration between academic institutions and enterprises in advancing plant lncRNA research and applications [24][26] - Innovations in lncRNA research are expected to contribute to sustainable agricultural practices and ecological balance [26][27]