Benchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2

Autor:	Jérôme Darmont, Florin Radulescu, Alexandru Boicea, Ciprian-Octavian Truica
Rok vydání:	2018
Předmět:	Information retrieval Computer Networks and Communications Computer science business.industry InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL 02 engineering and technology Benchmarking Document processing Oracle Weighting Text mining Okapi BM25 Hardware and Architecture 020204 information systems 0202 electrical engineering electronic engineering information engineering Benchmark (computing) 020201 artificial intelligence & image processing Relevance (information retrieval) business Software
Zdroj:	Future Generation Computer Systems. 85:60-75
ISSN:	0167-739X
DOI:	10.1016/j.future.2018.02.037
Popis:	Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T²K², a top-k keywords and documents benchmark, and its decision support-oriented evolution T²K²D². Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::1b550c1201b1111d379b2aaa89736097 https://doi.org/10.1016/j.future.2018.02.037 Zobrazit plný text záznamu Full Text from ScienceDirect