Content area
The flood of protein structural Big Data is coming. With the belief that biotech researchers deserve powerful analysis engines to overcome the challenge of rapidly increasing computational demands, we are devoted to developing efficient protein structural alignment search algorithms to assist researchers as they push the frontiers of biological sciences and technology. Here, we present SARST2, an algorithm that integrates primary, secondary, and tertiary structural features with evolutionary statistics to perform accurate and rapid alignments. In large-scale benchmarks, SARST2 outperforms state-of-the-art methods in accuracy, while completing AlphaFold Database searches significantly faster and with substantially less memory than BLAST and Foldseek. It employs a filter-and-refine strategy enhanced by machine learning, a diagonal shortcut for word-matching, a weighted contact number-based scoring scheme, and a variable gap penalty based on substitution entropy. SARST2, implemented in Golang as standalone programs available at
SARST2 enables rapid exploration of protein structure space. In minutes, it scans the 214-million-entry AlphaFold Database on a personal computer, revealing homologs with higher accuracy and lower memory/disk usage than leading methods.
Details
; Warshel, Arieh 2
; Lo, Chia-Hua 3 ; Choke, Chia Yee 4 ; Li, Yan-Jie 5 ; Yen, Shih-Chung 6
; Yang, Jyun-Yi 4
; Weng, Shih-Wen 7
1 Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)
2 Department of Chemistry, University of Southern California, Los Angeles, CA, USA (ROR: https://ror.org/03taz7m60) (GRID: grid.42505.36) (ISNI: 0000 0001 2156 6853)
3 Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu, Taiwan (ROR: https://ror.org/00zdnkx70) (GRID: grid.38348.34) (ISNI: 0000 0004 0532 0580)
4 Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)
5 Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)
6 Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); Department of Chemistry, University of Southern California, Los Angeles, CA, USA (ROR: https://ror.org/03taz7m60) (GRID: grid.42505.36) (ISNI: 0000 0001 2156 6853)
7 Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017); Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)