Nongreedy Unbalanced Huffman Tree Compressor for Single and Multifasta Files
Autor: | Sultan Alyami, Chun-Hsi Huang |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
Chromosomes Human Pair 22 Genomic data computer.software_genre Huffman coding Saccharomyces symbols.namesake Genetics Humans Vibrio cholerae Molecular Biology Process (computing) Computational Biology High-Throughput Nucleotide Sequencing Sequence Analysis DNA Neurospora Computational Mathematics Tree (data structure) Computational Theory and Mathematics Modeling and Simulation symbols Data mining computer Gas compressor Algorithms Software Data compression |
Zdroj: | Journal of Computational Biology. 27:868-876 |
ISSN: | 1557-8666 |
DOI: | 10.1089/cmb.2019.0249 |
Popis: | Next-generation sequencing technologies are producing genomic data at ever-increasing rates. It has become a challenge to store, transmit, and process the massive quantity of data, creating a vital need for a tool that compresses genomic data produced in a lossless manner, thus reducing storage space and speeding up data transmission. Data centers are adopting either of the two general-purpose genomic data compressors: gzip or bzip2. Both these use Huffman encoding, although they implement it in different ways. However, neither of these two takes advantage of properties of DNA data, such as the presence of a small alphabet and many repeats. Huffman encoding compression can be improved by exploiting DNA characteristics. Recently, it has been shown that Huffman encoding compression can be improved by creating an unbalanced Huffman tree (UHT), which demonstrates significant advances in compression over a standard Huffman tree used in both gzip and bzip2. However, the UHT created is greedy. This article proposes an improved nongreedy UHT (NUHT), a lossless nonreference-based fasta and multifasta compressor. We compare our algorithm with two well-known general-purpose compressors, gzip and bzip2, as well as with UHT, a DNA-specific compressor based on Huffman tree. Our algorithm outperforms all three in terms of compression ratio and is seven times faster than UHT. |
Databáze: | OpenAIRE |
Externí odkaz: |