Unsupervised learning analysis on the proteomes of Zika virus.

Autor: Lara-Ramírez EE; Laboratorio de Biotecnología Farmacéutica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México., Rivera G; Laboratorio de Biotecnología Farmacéutica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México., Oliva-Hernández AA; Laboratorio de Biotecnología Experimental, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México., Bocanegra-Garcia V; Laboratorio de Interacción Ambiente Microorganismo, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México., López JA; Laboratorio de microRNAs y Cáncer, Unidad Académica de Ciencias Biológicas, Universidad Autónoma de Zacatecas, Zacatecas, Zacatecas, México., Guo X; Laboratorio de Biotecnología Genómica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México.
Jazyk: angličtina
Zdroj: PeerJ. Computer science [PeerJ Comput Sci] 2024 Nov 11; Vol. 10, pp. e2443. Date of Electronic Publication: 2024 Nov 11 (Print Publication: 2024).
DOI: 10.7717/peerj-cs.2443
Abstrakt: Background: The Zika virus (ZIKV), which is transmitted by mosquito vectors to nonhuman primates and humans, causes devastating outbreaks in the poorest tropical regions of the world. Molecular epidemiology, supported by clustering phylogenetic gold standard studies using sequence data, has provided valuable information for tracking and controlling the spread of ZIKV. Unsupervised learning (UL), a form of machine learning algorithm, can be applied on the datasets without the need of known information for training.
Methods: In this work, unsupervised Random Forest (URF), followed by the application of dimensional reduction algorithms such as principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), t-distributed stochastic neighbor embedding (t-SNE), and autoencoders were used to uncover hidden patterns from polymorphic amino acid sites extracted on the proteome ZIKV multi-alignments, without the need of an underlying evolutionary model.
Results: The four UL algorithms revealed specific host and geographical clustering patterns for ZIKV. Among the four dimensionality reduction (DR) algorithms, the performance was better for UMAP. The four algorithms allowed the identification of imported viruses for specific geographical clusters. The UL dimension coordinates showed a significant correlation with phylogenetic tree branch lengths and significant phylogenetic dependence in Abouheif's Cmean and Pagel's Lambda tests (p value < 0.01) that showed comparable performance with the phylogenetic method. This analytical strategy was generalizable to an external large dengue type 2 dataset.
Conclusion: These UL algorithms could be practical evolutionary analytical techniques to track the dispersal of viral pathogens.
Competing Interests: The authors declare there are no competing interests.
(©2024 Lara-Ramírez et al.)
Databáze: MEDLINE