Stochastic approximation in Hilbert spaces.

Authors
Publication date
2017
Publication type
Thesis
Summary The goal of supervised learning is to infer relationships between a phenomenon that we wish to predict and "explanatory" variables. To this end, we have observations of multiple realizations of the phenomenon, from which we propose a prediction rule. The recent emergence of very large-scale data sources, both in terms of the number of observations made (in image analysis, for example) and the large number of explanatory variables (in genetics), has brought to light two difficulties: on the one hand, it becomes difficult to avoid the pitfall of overlearning when the number of explanatory variables is much greater than the number of observations. On the other hand, the algorithmic aspect becomes a determining factor, because the resolution of a linear system in the spaces involved can become a major difficulty. Algorithms derived from stochastic approximation methods offer a simultaneous answer to these two difficulties: the use of a stochastic method drastically reduces the algorithmic cost, without degrading the quality of the proposed prediction rule, by avoiding overlearning. In particular, the focus of this thesis will be on stochastic gradient methods. The very popular parametric methods propose as predictions linear functions of a chosen set of explanatory variables. However, these methods often result in an imprecise approximation of the underlying statistical structure. In the non-parametric framework, which is one of the central themes of this thesis, the restriction to linear predictors is lifted. The class of functions in which the predictor is constructed depends on the observations. In practice, non-parametric methods are crucial for various applications, in particular for the analysis of non-vector data, which can be associated to a vector in a functional space via the use of a positive definite kernel. This allows the use of algorithms associated with vector data, but requires an understanding of these algorithms in the associated non-parametric space: the reproducing kernel space. Moreover, the analysis of the non-parametric estimation also provides a revealing insight into the parametric framework, when the number of predictors greatly exceeds the number of observations. The first contribution of this thesis consists in a detailed analysis of the stochastic approximation in the non-parametric framework, in particular in the framework of reproducing kernel spaces. This analysis allows to obtain optimal convergence rates for the averaged stochastic gradient descent algorithm. The proposed analysis applies to many settings, and particular attention is paid to the use of minimal assumptions, as well as to the study of settings where the number of observations is known in advance, or can evolve. The second contribution is to propose an algorithm, based on an acceleration principle, which converges at an optimal speed, both from the optimization and statistical point of view. This allows, in the non-parametric framework, to improve the convergence to the optimal rate, in some regimes for which the first analyzed algorithm remained sub-optimal. Finally, the third contribution of the thesis consists in extending the studied framework beyond the least squares loss: the stochastic gradient descent algorithm is analyzed as a Markov chain. This approach results in an intuitive interpretation, and highlights the differences between the quadratic framework and the general framework. A simple method to substantially improve the convergence is also proposed.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr