MergedTrie: Efficient textual indexing
Autor: | Antonio Ferrández, Jesús Peral |
---|---|
Přispěvatelé: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante. Departamento de Tecnología Informática y Computación, Procesamiento del Lenguaje y Sistemas de Información (GPLSI), Lucentia |
Rok vydání: | 2019 |
Předmět: |
Big Data
Theoretical computer science Abstracting and Indexing Computer science Science Information Storage and Retrieval 02 engineering and technology Social Networking Set (abstract data type) 020204 information systems Textual indexing Trie 0202 electrical engineering electronic engineering information engineering Multidisciplinary String (computer science) Search engine indexing Correction Data structure Hash table Deterministic finite automaton MergedTrie Lenguajes y Sistemas Informáticos Medicine 020201 artificial intelligence & image processing Arquitectura y Tecnología de Computadores Algorithms Word (computer architecture) Research Article |
Zdroj: | PLoS ONE, Vol 14, Iss 4, p e0215288 (2019) RUA. Repositorio Institucional de la Universidad de Alicante Universidad de Alicante (UA) PLoS ONE |
ISSN: | 1932-6203 |
DOI: | 10.1371/journal.pone.0215288 |
Popis: | The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed. This study has been partially funded by the SEQUOIA-UA (TIN2015-63502-C3-3-R) and the RESCATA (TIN2015-65100-R) projects of the Spanish Ministry of Economy and Competitiveness (MINECO). |
Databáze: | OpenAIRE |
Externí odkaz: |