Variable selection for unsupervised high-dimensional classification.

Authors
  • MEYNET Caroline
  • MASSART Pascal
  • CELEUX Gilles
  • MASSART Pascal
  • CELEUX Gilles
  • BACH Francis
  • BIERNACKI Christophe
  • BIAU Gerard
  • POURSAT Marie anne
  • BACH Francis
  • BIERNACKI Christophe
Publication date
2012
Publication type
Thesis
Summary There are statistical modeling situations for which the classic problem of unsupervised classification (i.e., without a priori information on the nature or number of classes to be formed) is coupled with the problem of identifying the variables that are really relevant for determining the classification. This problem is all the more essential as so-called high-dimensional data, comprising many more variables than observations, have multiplied in recent years: gene expression data, curve classification. We propose a variable selection procedure for unsupervised classification adapted to high dimensional problems. We consider a Gaussian mixture model approach, which allows us to reformulate the problem of variable selection and the choice of the number of classes into a global model selection problem. We exploit the variable selection properties of the l1 regularization to efficiently build, from the data, a collection of models that remains of reasonable size even in high dimension. We differ from the classical l1 regularization variable selection procedures with respect to parameter estimation: in each model, instead of considering the Lasso estimator, we compute the maximum likelihood estimator. Then, we select one of these maximum likelihood estimators by a penalized non-asymptotic criterion based on the slope heuristic introduced by Birgé and Massart. From a theoretical point of view, we establish a model selection theorem for the estimation of a density by maximum likelihood for a random collection of models. We apply it in our context to find a minimal penalty form for our penalized criterion. From a practical point of view, simulations are performed to validate our procedure, in particular in the context of unsupervised curve classification. The key idea of our procedure is to use the l1 regularization only to build a restricted collection of models and not also to estimate the parameters of the models. This estimation step is performed by maximum likelihood. This hybrid procedure is inspired by a theoretical study carried out in the first part in which we establish l1 oracle inequalities for the Lasso in the frameworks of Gaussian regression and mixture of Gaussian regressions, which differ from the traditionally established l0 oracle inequalities by their total absence of assumption.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr