Distribution-free tests for lossless feature selection in classification and regression
Autor: | Györfi, László, Linder, Tamás, Walk, Harro |
---|---|
Rok vydání: | 2023 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | We study the problem of lossless feature selection for a $d$-dimensional feature vector $X=(X^{(1)},\dots ,X^{(d)})$ and label $Y$ for binary classification as well as nonparametric regression. For an index set $S\subset \{1,\dots ,d\}$, consider the selected $|S|$-dimensional feature subvector $X_S=(X^{(i)}, i\in S)$. If $L^*$ and $L^*(S)$ stand for the minimum risk based on $X$ and $X_S$, respectively, then $X_S$ is called lossless if $L^*=L^*(S)$. For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor based test statistics to test the hypothesis that $X_S$ is lossless. For the threshold $a_n=\log n/\sqrt{n}$, the corresponding tests are proved to be consistent under conditions on the distribution of $(X,Y)$ that are significantly milder than in previous work. Also, our threshold is dimension-independent, in contrast to earlier methods where for large $d$ the threshold becomes too large to be useful in practice. Comment: 22 pages |
Databáze: | arXiv |
Externí odkaz: |