Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing

Autor: Ratnesh Sahay, Oya Beyan, Dietrich Rebholz-Schuhmann, Stefan Decker, Achille Zappa, Rezaul Karim, Michael Cochez
Přispěvatelé: Publica, Artificial intelligence, Network Institute, Artificial Intelligence (section level)
Jazyk: angličtina
Rok vydání: 2022
Předmět:
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer science
0206 medical engineering
Rand index
Population
Machine Learning (stat.ML)
02 engineering and technology
Machine learning
computer.software_genre
Quantitative Biology - Quantitative Methods
Machine Learning (cs.LG)
Machine Learning
representation learning
Statistics - Machine Learning
Genetics
Cluster Analysis
Humans
1000 Genomes Project
Cluster analysis
education
Quantitative Methods (q-bio.QM)
education.field_of_study
business.industry
Applied Mathematics
genotype clustering
Autoencoder
deep neural networks
FOS: Biological sciences
Scalability
Artificial intelligence
Neural Networks
Computer

business
Population genomics
Feature learning
computer
Classifier (UML)
020602 bioinformatics
Algorithms
Biotechnology
bio-ancestry inference
Zdroj: Karim, M R, Cochez, M, Zappa, A, Sahay, R, Rebholz-Schuhmann, D, Beyan, O & Decker, S 2022, ' Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing ', IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 19, no. 1, pp. 369-382 . https://doi.org/10.1109/TCBB.2020.2994649
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(1), 369-382. Institute of Electrical and Electronics Engineers Inc.
ISSN: 1545-5963
Popis: The study of genetic variants can help find correlating population groups to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning algorithms are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we proposed convolutional embedded networks in which we combine two DNN architectures called convolutional embedded clustering and convolutional autoencoder classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning on 95 million GVs from the 1000 genomes and Simons genome diversity projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index of 0.915, the normalized mutual information of 0.92, and the clustering accuracy of 89%. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient(MCC) score of 0.9004 and 0.8245, respectively. To provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees(GBT) and SHAP. Overall, our approach is transparent and faster than the baseline methods, and scalable for 5% to 100% of the full human genome.
This article is under review in IEEE/ACM Transactions on Computational Biology and Bioinformatics. It is based on a workshop paper discussed at the Extended Semantic Web Conference (ESWC'2017) workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics (SeWeBMeDA), Slovenia, May, 28-29, 2017
Databáze: OpenAIRE