DIEULEVEUT Aymeric

< Back to ILB Patrimony
Affiliations
  • 2014 - 2018
    Apprentissage statistique et parcimonie
  • 2018 - 2021
    Détermination de Formes Et Identification
  • 2016 - 2017
    Ecole normale supérieure Paris
  • 2016 - 2017
    Communauté d'universités et établissements Université de Recherche Paris Sciences et Lettres
  • 2014 - 2018
    Département d'Informatique de l'Ecole Normale Supérieure
  • 2016 - 2017
    Sciences mathematiques de paris centre
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2015
  • Super-Acceleration with Cyclical Step-sizes.

    Baptiste GOUJAUD, Damien SCIEUR, Aymeric DIEULEVEUT, Adrien TAYLOR, Fabian PEDREGOSA
    2021
    Cyclical step-sizes are becoming increasingly popular in the optimization of deep learning problems. Motivated by recent observations on the spectral gaps of Hessians in machine learning, we show that these step-size schedules offer a simple way to exploit them. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.
  • Federated Expectation Maximization with heterogeneity mitigation and variance reduction.

    Aymeric DIEULEVEUT, Gersende FORT, Eric MOULINES, Genevieve ROBIN
    2021
    The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other field of machine learning, applications of latent variable models to very large datasets make the use of advanced parallel and distributed architectures mandatory. This paper introduces FedEM, which is the first extension of the EM algorithm to the federated learning context. FedEM is a new communication efficient method, which handles partial participation of local devices, and is robust to heterogeneous distributions of the datasets. To alleviate the communication bottleneck, FedEM compresses appropriately defined complete data sufficient statistics. We also develop and analyze an extension of FedEM to further incorporate a variance reduction scheme. In all cases, we derive finite-time complexity bounds for smooth non-convex problems. Numerical results are presented to support our theoretical findings, as well as an application to federated missing values imputation for biodiversity monitoring.
  • Debiasing Stochastic Gradient Descent to handle missing values.

    Aude SPORTISSE, Claire BOYER, Aymeric DIEULEVEUT, Julie JOSSE
    2020
    A major caveat of large scale data is their incom-pleteness. We propose an averaged stochastic gradient algorithm handling missing values in linear models. This approach has the merit to be free from the need of any data distribution modeling and to account for heterogeneous missing proportion. In both streaming and finite-sample settings, we prove that this algorithm achieves convergence rate of O(1 n) at the iteration n, the same as without missing values. We show the convergence behavior and the relevance of the algorithm not only on synthetic data but also on real data sets, including those collected from medical register.
  • Unsupervised Scalable Representation Learning for Multivariate Time Series.

    Jean yves FRANCESCHI, Aymeric DIEULEVEUT, Martin JAGGI
    2019
    Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by proposing an unsupervised method to learn universal embeddings of time series. Unlike previous works, it is scalable with respect to their length and we demonstrate the quality, transferability and practicability of the learned representations with thorough experiments and comparisons. To this end, we combine an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.
  • Unsupervised Scalable Representation Learning for Multivariate Time Series.

    Jean yves FRANCESCHI, Aymeric DIEULEVEUT, Martin JAGGI
    Thirty-third Conference on Neural Information Processing Systems | 2019
    No summary available.
  • Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains.

    Aymeric DIEULEVEUT, Alain DURMUS, Francis BACH
    2018
    We consider the minimization of an objective function given access to unbiased estimates of its gradient through stochastic gradient descent (SGD) with constant step-size. While the detailed analysis was only performed for quadratic functions, we provide an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general (non-quadratic) case. For this analysis, we bring tools from Markov chain theory into the analysis of stochastic gradient. We then show that Richardson-Romberg extrapolation may be used to get closer to the global optimum and we show empirical improvements of the new extrapolation scheme.
  • Stochastic approximation in Hilbert spaces.

    Aymeric DIEULEVEUT
    2017
    The goal of supervised machine learning is to infer relationships between a phenomenon one seeks to predict and “explanatory” variables. To that end, multiple occurrences of the phenomenon are observed, from which a prediction rule is constructed. The last two decades have witnessed the apparition of very large data-sets, both in terms of the number of observations (e.g., in image analysis) and in terms of the number of explanatory variables (e.g., in genetics). This has raised two challenges: first, avoiding the pitfall of over-fitting, especially when the number of explanatory variables is much higher than the number of observations. and second, dealing with the computational constraints, such as when the mere resolution of a linear system becomes a difficulty of its own. Algorithms that take their roots in stochastic approximation methods tackle both of these difficulties simultaneously: these stochastic methods dramatically reduce the computational cost, without degrading the quality of the proposed prediction rule, and they can naturally avoid over-fitting. As a consequence, the core of this thesis will be the study of stochastic gradient methods. The popular parametric methods give predictors which are linear functions of a set ofexplanatory variables. However, they often result in an imprecise approximation of the underlying statistical structure. In the non-parametric setting, which is paramount in this thesis, this restriction is lifted. The class of functions from which the predictor is proposed depends on the observations. In practice, these methods have multiple purposes, and are essential for learning with non-vectorial data, which can be mapped onto a vector in a functional space using a positive definite kernel. This allows to use algorithms designed for vectorial data, but requires the analysis to be made in the non-parametric associated space: the reproducing kernel Hilbert space. Moreover, the analysis of non-parametric regression also sheds some light on the parametric setting when the number of predictors is much larger than the number of observations. The first contribution of this thesis is to provide a detailed analysis of stochastic approximation in the non-parametric setting, precisely in reproducing kernel Hilbert spaces. This analysis proves optimal convergence rates for the averaged stochastic gradient descent algorithm. As we take special care in using minimal assumptions, it applies to numerous situations, and covers both the settings in which the number of observations is known a priori, and situations in which the learning algorithm works in an on-line fashion. The second contribution is an algorithm based on acceleration, which converges at optimal speed, both from the optimization point of view and from the statistical one. In the non-parametric setting, this can improve the convergence rate up to optimality, even inparticular regimes for which the first algorithm remains sub-optimal. Finally, the third contribution of the thesis consists in an extension of the framework beyond the least-square loss. The stochastic gradient descent algorithm is analyzed as a Markov chain. This point of view leads to an intuitive and insightful interpretation, that outlines the differences between the quadratic setting and the more general setting. A simple method resulting in provable improvements in the convergence is then proposed.
  • Stochastic approximation in Hilbert spaces.

    Aymeric DIEULEVEUT, Francis BACH, Stephane BOUCHERON, Francis BACH, Stephane BOUCHERON, Arnak s. DALALYAN, Lorenzo ROSASCO, Francois GLINEUR, Arnak s. DALALYAN, Lorenzo ROSASCO
    2017
    The goal of supervised learning is to infer relationships between a phenomenon that we wish to predict and "explanatory" variables. To this end, we have observations of multiple realizations of the phenomenon, from which we propose a prediction rule. The recent emergence of very large-scale data sources, both in terms of the number of observations made (in image analysis, for example) and the large number of explanatory variables (in genetics), has brought to light two difficulties: on the one hand, it becomes difficult to avoid the pitfall of overlearning when the number of explanatory variables is much greater than the number of observations. On the other hand, the algorithmic aspect becomes a determining factor, because the resolution of a linear system in the spaces involved can become a major difficulty. Algorithms derived from stochastic approximation methods offer a simultaneous answer to these two difficulties: the use of a stochastic method drastically reduces the algorithmic cost, without degrading the quality of the proposed prediction rule, by avoiding overlearning. In particular, the focus of this thesis will be on stochastic gradient methods. The very popular parametric methods propose as predictions linear functions of a chosen set of explanatory variables. However, these methods often result in an imprecise approximation of the underlying statistical structure. In the non-parametric framework, which is one of the central themes of this thesis, the restriction to linear predictors is lifted. The class of functions in which the predictor is constructed depends on the observations. In practice, non-parametric methods are crucial for various applications, in particular for the analysis of non-vector data, which can be associated to a vector in a functional space via the use of a positive definite kernel. This allows the use of algorithms associated with vector data, but requires an understanding of these algorithms in the associated non-parametric space: the reproducing kernel space. Moreover, the analysis of the non-parametric estimation also provides a revealing insight into the parametric framework, when the number of predictors greatly exceeds the number of observations. The first contribution of this thesis consists in a detailed analysis of the stochastic approximation in the non-parametric framework, in particular in the framework of reproducing kernel spaces. This analysis allows to obtain optimal convergence rates for the averaged stochastic gradient descent algorithm. The proposed analysis applies to many settings, and particular attention is paid to the use of minimal assumptions, as well as to the study of settings where the number of observations is known in advance, or can evolve. The second contribution is to propose an algorithm, based on an acceleration principle, which converges at an optimal speed, both from the optimization and statistical point of view. This allows, in the non-parametric framework, to improve the convergence to the optimal rate, in some regimes for which the first analyzed algorithm remained sub-optimal. Finally, the third contribution of the thesis consists in extending the studied framework beyond the least squares loss: the stochastic gradient descent algorithm is analyzed as a Markov chain. This approach results in an intuitive interpretation, and highlights the differences between the quadratic framework and the general framework. A simple method to substantially improve the convergence is also proposed.
  • Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression.

    Aymeric DIEULEVEUT, Nicolas FLAMMARION, Francis BACH
    Journal of Machine Learning Research | 2017
    We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.
  • Non-parametric Stochastic Approximation with Large Step sizes.

    Aymeric DIEULEVEUT, Francis BACH
    2015
    We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS $\mathcal{H}$, even if the optimal predictor (i.e., the conditional expectation) is not in $\mathcal{H}$. In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$.
Affiliations are detected from the signatures of publications identified in scanR. An author can therefore appear to be affiliated with several structures or supervisors according to these signatures. The dates displayed correspond only to the dates of the publications found. For more information, see https://scanr.enseignementsup-recherche.gouv.fr