Naive Feature Selection: A Nearly Tight Convex Relaxation for Sparse Naive Bayes

Autor:	Armin Askari, Alexandre d’Aspremont, Laurent El Ghaoui
Rok vydání:	2023
Předmět:	General Mathematics Management Science and Operations Research Computer Science Applications
Zdroj:	Mathematics of Operations Research.
ISSN:	1526-5471 0364-765X
DOI:	10.1287/moor.2023.1356
Popis:	Because of its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our convex relaxation bounds become tight as the marginal contribution of additional features decreases using a priori duality gap bounds derived from the Shapley–Folkman theorem. We show how to produce primal solutions satisfying these bounds. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared with the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, l1-penalized logistic regression, and LASSO while being orders of magnitude faster (a python implementation can be found at https://github.com/aspremon/NaiveFeatureSelection ). Funding: A. d’Aspremont acknowledges support from the Fonds AXA Pour la Recherche and Kamet Ventures [Machine Learning and Optimisation Joint Research Initiative] and a Google-focused award as well as funding from Agence Nationale de la Recherche [Grant ANR-19-P3IA-0001]. L. El Ghaoui acknowledges support from Berkeley Artificial Intelligence Research and the Tsinghua–Berkeley–Shenzhen Institute.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::2126c7845d7e137ee813e4f3cfe7c110 https://doi.org/10.1287/moor.2023.1356 Zobrazit plný text záznamu