Contributions à la calibration d'algorithmes d'apprentissage : Validation-croisée et détection de ruptures

Autor: Celisse, Alain
Přispěvatelé: Laboratoire Paul Painlevé - UMR 8524 (LPP), Université de Lille-Centre National de la Recherche Scientifique (CNRS), MOdel for Data Analysis and Learning (MODAL), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Paul Painlevé - UMR 8524 (LPP), Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Evaluation des technologies de santé et des pratiques médicales - ULR 2694 (METRICS), Université de Lille-Centre Hospitalier Régional Universitaire [Lille] (CHRU Lille)-Université de Lille-Centre Hospitalier Régional Universitaire [Lille] (CHRU Lille)-École polytechnique universitaire de Lille (Polytech Lille)-Université de Lille, Sciences et Technologies, Université de Lille, Eric Moulines, Laboratoire Paul Painlevé (LPP), Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Evaluation des technologies de santé et des pratiques médicales - ULR 2694 (METRICS), Université de Lille-Centre Hospitalier Régional Universitaire [Lille] (CHRU Lille)-Université de Lille-Centre Hospitalier Régional Universitaire [Lille] (CHRU Lille)-École polytechnique universitaire de Lille (Polytech Lille), Celisse, Alain
Jazyk: angličtina
Rok vydání: 2018
Předmět:
Zdroj: Statistics [math.ST]. Université de Lille, 2018
Popis: The present manuscript mainly focus on cross-validation procedures (and in particular on leave-p-out (LpO)),describing its practical aspects as well as new strategies leading to non-asymptotic theoretical guarantees on itsstatistical performance (concentration inequalities, oracle inequalities). As a privileged application, cross-validationis also used to address the multiple change-points detection problem in the off-line context. This problem is thentackled in a more general framework by means of reproducing kernels and the model selection paradigm.After introducing the cross-validation procedures in Chapter 1, ongoing strategies allowing us to efficientlycompute cross-validation estimators are detailed in Chapter 2. In particular several of them yield closed-formexpressions for the LpO estimator, which considerably reduces the computational cost. Such closed-form expressionshave been already derived in density estimation with projection and kernel estimators, and with k-nearest neighborsestimators in the regression and binary classification contexts.Chapter 3 discusses the statistical properties of the cross-validation estimators (used as risk estimators) interms of bias, variance, and mean squared error. For instance among cross-validation estimators, it is establishedthat the LpO one enjoys the lowest variance for a given test set cardinality. The leave-one-out (L1O) estimatoris also proved to be asymptotically optimal in terms of mean squared error in density estimation with projectionestimators.Several approaches leading to concentration inequalities of the LpO estimator around its expectation are dis-cussed in Chapter 4. A direct approach relying on the combination of closed-form expressions and the classicalconcentration inequalities of Bernstein and Talagrand is first exposed in the density estimation context. A moregeneral approach is then described which exploits the link between the LpO estimator and U-statistics. Its mainunderlying idea is to deduce exponential concentration results for the LpO estimator from moment inequalities.The derivation of the preliminary results also involve the stability of the used learning algorithm.The important question of model/statiscal algorithm selection is addressed in Chapter 5 in the particular caseof density estimation. The optimality of the LpO-based model selection procedure is proved under some condi-tions both in the estimation purpose—by means of a non-asymptotic oracle inequality—and in the identificationpurpose—through a model consistency result.Cross-validation is then used to tackle the multiple change-points detection problem in the off-line setting, wherethe variance is allowed to vary along the time (heteroscedastic setting). Chapter 6 summarizes the conclusionsdrawn from theoretical as well as empirical results about the behavior of cross-validation procedures. In particular,these conclusions lead us to suggest new model selection procedures relying on cross-validation. At the price ofa higher computational cost, these procedures automatically take into account changes arising in the variance forinstance, which improves the statistical performance. The more general question of detecting changes arising thefull distribution of the observations (and not only in the mean) is also addressed by means of reproducing kernels.A new model selection procedure is designed that is based on a penalty derived in the reproducing kernel Hilbertspace framework. Its non-asymptotic performance is quantified through an oracle inequality with high probability.Numerous aspects of the new procedure are also empirically assessed in the empirical study. For instance, theresults illustrate that the chosen kernel clearly influences the final performance.Finally the manuscript ends with Chapter 7 highlighting several challenging perspectives which could give riseto important improvements both on the practical and theoretical sides.
Databáze: OpenAIRE