Popis: |
We present GNNfam, a pipeline for predicting protein families from protein sequences. GNNfam aligns proteins using pairwise sequence aligner LAST, creates a sparse graph based on the alignment scores, and employs graph neural networks (GNNs) to predict protein families. Unlike alignment-free deep learning methods such as DeepFam, GNNfam can control the sparsity of the protein similarity graph to prune uninformative edges. We develop three pruning strategies to improve the prediction accuracy, convergence, and running time of the downstream graph neural networks. We also demonstrate that semi-supervised GNNs outperform traditional graph clustering-based methods by a large margin. When trained with three labeled sequence datasets from the SCOPe and COG databases, GNNfam achieves more than 90% test accuracy when predicting protein families and performs significantly better than clustering, embedding and other deep learning methods. GNNfam is available at https://github.com/HipGraph/GNNfam. |