Fully Online Grammar Compression in Constant Space

Autor: Maruyama, Shirou, Tabei, Yasuo
Rok vydání: 2014
Předmět:
Druh dokumentu: Working Paper
Popis: We present novel variants of fully online LCA (FOLCA), a fully online grammar compression that builds a straight line program (SLP) and directly encodes it into a succinct representation in an online manner. FOLCA enables a direct encoding of an SLP into a succinct representation that is asymptotically equivalent to an information theoretic lower bound for representing an SLP (Maruyama et al., SPIRE'13). The compression of FOLCA takes linear time proportional to the length of an input text and its working space depends only on the size of the SLP, which enables us to apply FOLCA to large-scale repetitive texts. Recent repetitive texts, however, include some noise. For example, current sequencing technology has significant error rates, which embeds noise into genome sequences. For such noisy repetitive texts, FOLCA working in the SLP size consumes a large amount of memory. We present two variants of FOLCA working in constant space by leveraging the idea behind stream mining techniques. Experiments using 100 human genomes corresponding to about 300GB from the 1000 human genomes project revealed the applicability of our method to large-scale, noisy repetitive texts.
Comment: This is an extended version of a proceeding accepted to Data Compression Conference (DCC), 2014
Databáze: arXiv