Dante: genotyping of known complex and expanded short tandem repeats
Autor: | Tomáš Szemes, Jaroslav Budis, Jan Radvanszky, Andrej Ficek, Juraj Gazdarica, Broňa Brejová, František Ďuriš, Michaela Zrubcová, Marcel Kucharik |
---|---|
Rok vydání: | 2018 |
Předmět: |
Statistics and Probability
Genotype Computer science Sequence analysis Locus (genetics) Computational biology Biochemistry 03 medical and health sciences Humans Nucleotide Allele Repeated sequence Molecular Biology Genotyping Gene Alleles Polymerase 030304 developmental biology Sequence (medicine) chemistry.chemical_classification Protein coding 0303 health sciences Massive parallel sequencing biology 030302 biochemistry & molecular biology High-Throughput Nucleotide Sequencing Sequence Analysis DNA Computer Science Applications Computational Mathematics Computational Theory and Mathematics chemistry biology.protein Microsatellite Human genome Microsatellite Repeats |
Zdroj: | Bioinformatics. 35:1310-1317 |
ISSN: | 1367-4811 1367-4803 |
Popis: | Motivation Short tandem repeats (STRs) are stretches of repetitive DNA in which short sequences, typically made of 2–6 nucleotides, are repeated several times. Since STRs have many important biological roles and also belong to the most polymorphic parts of the human genome, they became utilized in several molecular-genetic applications. Precise genotyping of STR alleles, therefore, was of high relevance during the last decades. Despite this, massively parallel sequencing (MPS) still lacks the analysis methods to fully utilize the information value of STRs in genome scale assays. Results We propose an alignment-free algorithm, called Dante, for genotyping and characterization of STR alleles at user-specified known loci based on sequence reads originating from STR loci of interest. The method accounts for natural deviations from the expected sequence, such as variation in the repeat count, sequencing errors, ambiguous bases and complex loci containing several different motifs. In addition, we implemented a correction for copy number defects caused by the polymerase induced stutter effect as well as a prediction of STR expansions that, according to the conventional view, cannot be fully captured by inherently short MPS reads. We tested Dante on simulated datasets and on datasets obtained by targeted sequencing of protein coding parts of thousands of selected clinically relevant genes. In both these datasets, Dante outperformed HipSTR and GATK genotyping tools. Furthermore, Dante was able to predict allele expansions in all tested clinical cases. Availability and implementation Dante is open source software, freely available for download at https://github.com/jbudis/dante. Supplementary information Supplementary data are available at Bioinformatics online. |
Databáze: | OpenAIRE |
Externí odkaz: |