De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
Autor: | Johan Dahlberg, Adam Ameur, Susana Häggqvist, Jessica Nordlund, Ulf Gyllensten, Ida Höijer, Marcel Martin, Huiwen Che, Francesco Vezzi, Ignas Bunikis, Pall I Olason, Lars Feuk |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2018 |
Předmět: |
0301 basic medicine
human reference genome lcsh:QH426-470 Sequencing data Population Sequence assembly Computational biology SMRT sequencing Biology de novo assembly Genome DISEASE Article 03 medical and health sciences Genetics Genetik education human whole-genome sequencing Genetics (clinical) Genetics & Heredity Protein coding education.field_of_study Science & Technology Autosome GENETIC-VARIATION Chromosome Swedish population lcsh:Genetics 030104 developmental biology population sequencing GRCh38 Life Sciences & Biomedicine Single molecule real time sequencing Reference genome Personal genomics |
Zdroj: | Genes, Vol 9, Iss 10, p 486 (2018) Genes Volume 9 Issue 10 |
Popis: | The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data. |
Databáze: | OpenAIRE |
Externí odkaz: |