DandD: Efficient measurement of sequence growth and similarity

Autor: Jessica K. Bonnie, Omar Y. Ahmed, Ben Langmead
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: iScience, Vol 27, Iss 3, Pp 109054- (2024)
Druh dokumentu: article
ISSN: 2589-0042
DOI: 10.1016/j.isci.2024.109054
Popis: Summary: Genome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows. DandD can describe how much structural variation is discovered in each new human genome assembly and when discoveries will level off in the future. DandD uses a measure called δ (“delta”), developed initially for data compression and chiefly dependent on k-mer counts. DandD rapidly estimates δ using genomic sketches. We propose δ as an alternative to k-mer-specific cardinalities when computing the Jaccard coefficient, thereby avoiding the pitfalls of a poor choice of k. We demonstrate the utility of DandD’s functions for estimating δ, characterizing the rate of pangenome growth, and computing all-pairs similarities using k-independent Jaccard.
Databáze: Directory of Open Access Journals