Robust machine learning by median-of-means : theory and practice
Autor: | Matthieu Lerasle, Guillaume Lecué |
---|---|
Přispěvatelé: | Ecole Nationale de la Statistique et de l'Analyse Economique (ENSAE), Ecole Nationale de la Statistique et de l'Analyse Economique, Model selection in statistical learning (SELECT), Laboratoire de Mathématiques d'Orsay (LMO), Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Statistique mathématique et apprentissage (CELESTE), Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de Mathématiques d'Orsay (LMO), Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-Centre National de la Recherche Scientifique (CNRS) |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
Statistics and Probability
Mathematics - Statistics Theory Statistics Theory (math.ST) 01 natural sciences 010104 statistics & probability Lasso (statistics) Simple (abstract algebra) 62G08 high-dimensional statistics FOS: Mathematics 62G05 0101 mathematics 62G20 Mathematics 62C20 Estimator [STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] Minimax Proof of concept 60K35 Empirical processes Outlier Standard algorithms High-dimensional statistics Statistics Probability and Uncertainty Algorithm |
Zdroj: | Annals of Statistics Annals of Statistics, Institute of Mathematical Statistics, 2020, ⟨10.1214/19-AOS1828⟩ Ann. Statist. 48, no. 2 (2020), 906-931 Annals of Statistics, 2020, ⟨10.1214/19-AOS1828⟩ |
ISSN: | 0090-5364 2168-8966 |
DOI: | 10.1214/19-AOS1828⟩ |
Popis: | We introduce new estimators for robust machine learning based on median-of-means (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is non-asymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of (number of observations)*(rate of convergence). For instance, the breakdown number of our estimators for the problem of estimation of a d-dimensional vector with a noise variance sigma^2 is sigma^2d and it becomes sigma^2 s log(d/s) when this vector has only s non-zero component. Beyond this breakdown point, we proved that the rate of convergence achieved by our estimator is (number of outliers) divided by (number of observation). Besides these theoretical guarantees, the major improvement brought by these new estimators is that they are easily computable in practice. In fact, basically any algorithm used to approximate the standard Empirical Risk Minimizer (or its regularized versions) has a robust version approximating our estimators. As a proof of concept, we study many algorithms for the classical LASSO estimator. A byproduct of the MOM algorithms is a measure of depth of data that can be used to detect outliers. Comment: 48 pages, 6 figures |
Databáze: | OpenAIRE |
Externí odkaz: |