Gaps and complex structurally variant loci in phased genome assemblies

Autor: Ahmad Abou Tayoun, David Porubsky, Peter Ebert, William Harvey, Catherine Stober Brasseur, Jan Korbel, Ashley Sanders, Mitchell Vollger
Rok vydání: 2022
DOI: 10.1101/2022.07.06.498874
Popis: There has been tremendous progress in the production of phased genome assemblies by combining long-read data with parental information or linking read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than ~140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 77 phased and assembled human genomes (154 unique haplotypes). We find that trio-based approaches using HiFi are the current gold standard although chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. We find two-thirds of defined contig ends cluster near the largest and most identical repeats [including segmental duplications (35.4%) or satellite DNA (22.3%) or to regions enriched in GA/AT rich DNA (27.4%)]. As a result, 1513 protein-coding genes overlap assembly gaps in at least one haplotype and 231 are recurrently disrupted or missing from five or more haplotypes. In addition, we estimate that 6-7 Mbp of DNA are incorrectly orientated per haplotype irrespective of whether trio-free or trio-based approaches are employed. 81% of such misorientations correspond tobona fidelarge inversion polymorphisms in the human species, most of which are flanked by large identical segmental duplications. In addition, we also identify large-scale alignment discontinuities consistent with an 11.9 Mbp deletion and 161.4 Mbp of insertion per human haploid genome. While 99% of this variation corresponds to satellite DNA, we identify 230 regions of the euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Although not completely resolved, these regions include copy number polymorphic and biomedically relevant genic regions where complete resolution and a pangenome representation will be most useful, yet most challenging, to realize.
Databáze: OpenAIRE