Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees

Autor:	Klemens Böhm, Jens Willkomm, Martin Schäler
Rok vydání:	2021
Předmět:	050101 languages & linguistics Computer science Suffix tree 05 social sciences String (computer science) 02 engineering and technology String searching algorithm Query optimization law.invention Tree (data structure) law 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Cardinality (SQL statements) Pruning (decision trees) Suffix Algorithm
Zdroj:	Database Systems for Advanced Applications ISBN: 9783030731960 DASFAA (2)
Popis:	Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method “takes away” more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character’s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::c514c48a5f6629f1794e8fac78aba5b4 https://doi.org/10.1007/978-3-030-73197-7_50 Zobrazit plný text záznamu