Anchored pseudo-de novo assembly of human genomes identifies extensive sequence variation from unmapped sequence reads
Autor: | Kim H. Brown, Joshua J. Faber-Hammond |
---|---|
Rok vydání: | 2016 |
Předmět: |
0301 basic medicine
Sequence assembly Genomics Sequence alignment Computational biology Biology Article 03 medical and health sciences 0302 clinical medicine Genetics RefSeq Humans 1000 Genomes Project Genetics (clinical) Sequence (medicine) Contig Genome Human Genetic Variation High-Throughput Nucleotide Sequencing Sequence Analysis DNA 030104 developmental biology Human genome Sequence Alignment 030217 neurology & neurosurgery |
Zdroj: | Human Genetics. 135:727-740 |
ISSN: | 1432-1203 0340-6717 |
DOI: | 10.1007/s00439-016-1667-5 |
Popis: | The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2–5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10–20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine. |
Databáze: | OpenAIRE |
Externí odkaz: |