Nongreedy Unbalanced Huffman Tree Compressor for Single and Multifasta Files

Autor: Sultan Alyami, Chun-Hsi Huang
Rok vydání: 2020
Předmět:
Zdroj: Journal of Computational Biology. 27:868-876
ISSN: 1557-8666
DOI: 10.1089/cmb.2019.0249
Popis: Next-generation sequencing technologies are producing genomic data at ever-increasing rates. It has become a challenge to store, transmit, and process the massive quantity of data, creating a vital need for a tool that compresses genomic data produced in a lossless manner, thus reducing storage space and speeding up data transmission. Data centers are adopting either of the two general-purpose genomic data compressors: gzip or bzip2. Both these use Huffman encoding, although they implement it in different ways. However, neither of these two takes advantage of properties of DNA data, such as the presence of a small alphabet and many repeats. Huffman encoding compression can be improved by exploiting DNA characteristics. Recently, it has been shown that Huffman encoding compression can be improved by creating an unbalanced Huffman tree (UHT), which demonstrates significant advances in compression over a standard Huffman tree used in both gzip and bzip2. However, the UHT created is greedy. This article proposes an improved nongreedy UHT (NUHT), a lossless nonreference-based fasta and multifasta compressor. We compare our algorithm with two well-known general-purpose compressors, gzip and bzip2, as well as with UHT, a DNA-specific compressor based on Huffman tree. Our algorithm outperforms all three in terms of compression ratio and is seven times faster than UHT.
Databáze: OpenAIRE