Foldseek

Search documents
Nature Biotechnology:西湖大学原发杰/常兴团队等开发ProTrek,以自然语言“导航”蛋白质宇宙
生物世界· 2025-10-03 01:00
Core Insights - The article discusses the development of ProTrek, a novel trimodal protein language model that integrates amino acid sequences, three-dimensional structures, and natural language descriptions for advanced protein searches [3][9][20]. Group 1: Challenges and Opportunities in Protein Research - Proteins are essential for life, and understanding the complex relationship between their sequences, structures, and functions is crucial for molecular science and pharmacology [6]. - Traditional tools like BLAST and Foldseek are limited to single-modal comparisons, hindering the discovery of cross-modal relationships between sequences, structures, and functions [6][9]. - Approximately 30% of proteins in the UniProt database remain functionally unannotated due to their distant phylogenetic relationships with known homologs, likened to "dark matter" in the protein universe [6][9]. Group 2: ProTrek's Innovative Framework - ProTrek employs a unique trimodal framework that unifies three core protein information types: amino acid sequences (1D), three-dimensional structures (spatial), and natural language function descriptions (semantic) [9][20]. - The model utilizes a bidirectional alignment framework to establish strong correlations across sequence-structure, structure-function, and function-sequence dimensions [9][21]. Group 3: Performance and Experimental Validation - ProTrek demonstrates superior performance, outperforming existing top methods like ProteinDT and ProtST by over 30-60 times in standard protein function retrieval benchmarks [11][21]. - The model's global representation learning capability allows it to identify convergent evolution proteins that have significant sequence and structure differences but perform similar functions [11][21]. - Experimental validation showed ProTrek's ability to discover new proteins with similar functions to human UDG, achieving higher editing efficiency and lower off-target effects compared to existing tools [15][23]. Group 4: Implications and Future Prospects - ProTrek enhances the efficiency and depth of protein research, facilitating large-scale annotation of unknown protein functions and accelerating enzyme discovery and drug design [18][23]. - The model's integration of complex molecular data with intuitive natural language promotes a better understanding of the protein world [18][23]. - ProTrek's capabilities are expected to lead to new scientific discoveries across various fields of protein science [18][23].