On the efficiency of data collection for multiple Naïve Bayes classifiers

Autor:	Edoardo Manino, Nicholas R. Jennings, Long Tran-Thanh
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	Linguistics and Language Data collection business.industry Computer science Sampling (statistics) 02 engineering and technology Random walk Machine learning computer.software_genre Language and Linguistics Exponential function QA76 Naive Bayes classifier Empirical research Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence business Representation (mathematics) computer Independence (probability theory)
ISSN:	0004-3702
Popis:	Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how to assign the limited pool of available predictors to the individual classification problems arises. Empirical studies show that the policies we use to perform such assignments have a strong impact on the accuracy of the system. However, to date there is little theoretical understanding of this phenomenon. To help rectify this, in this paper we provide the first theoretical explanation of the accuracy gap between the most popular policies: the non-adaptive uniform allocation, and the adaptive allocation schemes based on uncertainty sampling and information gain maximisation. To do so, we propose a novel representation of the data collection process in terms of random walks. Then, we use this tool to derive new lower and upper bounds on the accuracy of the policies. These bounds reveal that the tradeoff between the number of available predictors and the accuracy has a different exponential rate depending on the policy used. By comparing them, we are able to quantify the advantage that the two adaptive policies have over the non-adaptive one for the first time, and prove that the probability of error of the former decays at more than double the exponential rate of the latter. Furthermore, we show in our analysis that this result holds both in the case where we know the accuracy of each individual predictor, and in the case where we only have access to a noisy estimate of it.\ud \ud
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d4fe5b0caa0d51ae919de4c0fbf2e319 http://wrap.warwick.ac.uk/139664/1/WRAP-On-effiency-data-collection-multiple-Tran-Thanh-2019.pdf Zobrazit plný text záznamu Full Text from ScienceDirect