Making compression algorithms for Unicode text

Autor:	Adam Gleave, Christian Steinruecken
Rok vydání:	2017
Předmět:	FOS: Computer and information sciences Computer science Computer Science - Information Theory Speech recognition 050801 communication & media studies Character encoding 02 engineering and technology Data_CODINGANDINFORMATIONTHEORY computer.software_genre ASCII 0508 media and communications Encoding (memory) 0202 electrical engineering electronic engineering information engineering Binary Ordered Compression for Unicode Lossless compression business.industry Information Theory (cs.IT) 05 social sciences Byte Unicode ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing Data compression
Zdroj:	DCC
DOI:	10.48550/arxiv.1701.04047
Popis:	The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Previous research has focused on developing Unicode compressors from scratch, which often failed to outperform established algorithms such as bzip2. We develop a technique to modify byte-based compressors to operate directly on Unicode characters, and implement variants of LZW and PPM that apply this technique. We find that our method substantially improves compression effectiveness on a UTF-8 corpus, with our PPM variant outperforming the state-of-the-art PPMII compressor. On ASCII and binary files, our variants perform similarly to the original unmodified compressors. Comment: 10 pages
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::8b66965cef40367e9192f995d33d958e Zobrazit plný text záznamu