Searcher - Institut Louis Bachelier

A Continuized View on Nesterov Acceleration.

Raphael BERTHIER, Francis BACH, Nicolas FLAMMARION, Pierre GAILLARD, Adrien TAYLOR

2021

We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters. but a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters.

More informations

Averaging Stochastic Gradient Descent on Riemannian Manifolds.

Nilesh TRIPURANENI, Nicolas FLAMMARION, Francis BACH, Michael i. JORDAN

Computational Learning Theory (COLT) | 2018

We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

More informations

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression.

Aymeric DIEULEVEUT, Nicolas FLAMMARION, Francis BACH

Journal of Machine Learning Research | 2017

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.

More informations

Stochastic approximation and least-squares regression, with applications to machine learning.

Nicolas FLAMMARION, Francis BACH, Alexandre d ASPREMONT, Eric MOULINES, Francis BACH, Alexandre d ASPREMONT, Eric MOULINES, Jerome BOLTE, Arnak s. DALALYAN, Jerome BOLTE

2017

Multiple problems in machine learning involve minimizing a smooth function on a Euclidean space. For supervised learning, this includes least squares and logistic regressions. While small-scale problems are efficiently solved with many optimization algorithms, large-scale problems require first-order methods from gradient descent. In this manuscript, we consider the special case of quadratic loss. In a first part, we propose to minimize it with a stochastic oracle. In a second part, we consider two of its applications to machine learning: data partitioning and shape constrained estimation. The first contribution is a unified framework for the optimization of non-strongly convex quadratic functions. It includes accelerated gradient descent and averaged gradient descent. This new framework suggests an alternative algorithm that combines the positive aspects of averaging and accelerating. The second contribution is to obtain the optimal prediction error rate for the least squares regression as a function of the noise dependence of the problem and the forgetting of the initial conditions. Our new algorithm is based on accelerated and averaged gradient descent. The third contribution deals with the minimization of composite functions, sum of the expectation of quadratic functions and a convex regularization. We extend the existing results for least squares to any regularization and to the different geometries induced by a Bregman divergence. In a fourth contribution, we consider the discriminative partitioning problem. We propose its first theoretical analysis, a parsimonious extension, its extension to the multi-label case and a new algorithm with a better complexity than the existing methods. The last contribution of this thesis considers the problem of seriation. We adopt a statistical approach where the matrix is observed with noise and we study the minimax estimation rates. We also propose a computationally efficient estimator.

More informations See the publication

Robust Discriminative Clustering with Sparse Regularizers.

Nicolas FLAMMARION, Balamurugan PALANIAPPAN, Francis BACH

Journal of Machine Learning Research | 2017

Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from " noise-looking " variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem. (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions. (c) we propose a natural extension for the multi-label scenario. (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form d = O(√ n) for the affine invariant case and d = O(n) for the sparse case, where n is the number of examples and d the ambient dimension. and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to O(nd 2), improving on earlier algorithms which had quadratic complexity in the number of examples.

More informations

Stochastic Composite Least-Squares Regression with Convergence Rate O(1/n).

Nicolas FLAMMARION, Francis BACH

Proceedings of The 30th Conference on Learning Theory, (COLT) | 2017

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.

More informations

Stochastic approximation and least-squares regression, with applications to machine learning.

Nicolas FLAMMARION

2017

Many problems in machine learning are naturally cast as the minimization of a smooth function defined on a Euclidean space. For supervised learning, this includes least-squares regression and logistic regression. While small problems are efficiently solved by classical optimization algorithms, large-scale problems are typically solved with first-order techniques based on gradient descent. In this manuscript, we consider the particular case of the quadratic loss. In the first part, we are interestedin its minimization when its gradients are only accessible through a stochastic oracle. In the second part, we consider two applications of the quadratic loss in machine learning: clustering and estimation with shape constraints. In the first main contribution, we provided a unified framework for optimizing non-strongly convex quadratic functions, which encompasses accelerated gradient descent and averaged gradient descent. This new framework suggests an alternative algorithm that exhibits the positive behavior of both averaging and acceleration. The second main contribution aims at obtaining the optimal prediction error rates for least-squares regression, both in terms of dependence on the noise of the problem and of forgetting the initial conditions. Our new algorithm rests upon averaged accelerated gradient descent. The third main contribution deals with minimization of composite objective functions composed of the expectation of quadratic functions and a convex function. Weextend earlier results on least-squares regression to any regularizer and any geometry represented by a Bregman divergence. As a fourth contribution, we consider the the discriminative clustering framework. We propose its first theoretical analysis, a novel sparse extension, a natural extension for the multi-label scenario and an efficient iterative algorithm with better running-time complexity than existing methods. The fifth main contribution deals with the seriation problem. We propose a statistical approach to this problem where the matrix is observed with noise and study the corresponding minimax rate of estimation. We also suggest a computationally efficient estimator whose performance is studied both theoretically and experimentally.

More informations

Optimal Rates of Statistical Seriation.

Nicolas FLAMMARION, Cheng MAO, Philippe RIGOLLET

2016

Given a matrix the seriation problem consists in permuting its rows in such way that all its columns have the same shape, for example, they are monotone increasing. We propose a statistical approach to this problem where the matrix of interest is observed with noise and study the corresponding minimax rate of estimation of the matrices. Specifically, when the columns are either unimodal or monotone, we show that the least squares estimator is optimal up to logarithmic factors and adapts to matrices with a certain natural structure. Finally, we propose a computationally efficient estimator in the monotonic case and study its performance both theoretically and experimentally. Our work is at the intersection of shape constrained estimation and recent work that involves permutation learning, such as graph denoising and ranking.

More informations

From Averaging to Acceleration, There is Only a Step-size.

Nicolas FLAMMARION, Francis BACH

Proceedings of The 28th Conference on Learning Theory, (COLT) | 2015

We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n 2), where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system , showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

More informations

FLAMMARION Nicolas

A Continuized View on Nesterov Acceleration.