Computational identification of ultra-conserved elements in the human genome: a hypothesis on homologous DNA pairing.

Autor: Crossley ER; Program of Bioinformatics and Proteomics/Genomics, University of Toledo, Toledo, OH 43606, USA., Fedorova L; CRI Genetics LLC, Santa Monica, CA 90404, USA., Mulyar OA; CRI Genetics LLC, Santa Monica, CA 90404, USA., Freeman R; CRI Genetics LLC, Santa Monica, CA 90404, USA., Khuder S; Program of Bioinformatics and Proteomics/Genomics, University of Toledo, Toledo, OH 43606, USA.; Department of Medicine, University of Toledo, Toledo, OH 43606, USA., Fedorov A; Program of Bioinformatics and Proteomics/Genomics, University of Toledo, Toledo, OH 43606, USA.; CRI Genetics LLC, Santa Monica, CA 90404, USA.; Department of Medicine, University of Toledo, Toledo, OH 43606, USA.
Jazyk: angličtina
Zdroj: NAR genomics and bioinformatics [NAR Genom Bioinform] 2024 Jul 02; Vol. 6 (3), pp. lqae074. Date of Electronic Publication: 2024 Jul 02 (Print Publication: 2024).
DOI: 10.1093/nargab/lqae074
Abstrakt: Thousands of prolonged sequences of human ultra-conserved non-coding elements (UCNEs) share only one common feature: peculiarities in the unique composition of their dinucleotides. Here we investigate whether the numerous weak signals emanating from these dinucleotide arrangements can be used for computational identification of UCNEs within the human genome. For this purpose, we analyzed 4272 UCNE sequences, encompassing 1 393 448 nucleotides, alongside equally sized control samples of randomly selected human genomic sequences. Our research identified nine different features of dinucleotide arrangements that enable differentiation of UCNEs from the rest of the genome. We employed these nine features, implementing three Machine Learning techniques - Support Vector Machine, Random Forest, and Artificial Neural Networks - to classify UCNEs, achieving an accuracy rate of 82-84%, with specific conditions allowing for over 90% accuracy. Notably, the strongest feature for UCNE identification was the frequency ratio between GpC dinucleotides and the sum of GpG and CpC dinucleotides. Additionally, we investigated the entire pool of 31 046 SNPs located within UCNEs for their representation in the ClinVar database, which catalogs human SNPs with known phenotypic effects. The presence of UCNE-associated SNPs in ClinVar aligns with the expectation of a random distribution, emphasizing the enigmatic nature of UCNE phenotypic manifestation.
(© The Author(s) 2024. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.)
Databáze: MEDLINE