GAIFFAS Stephane

< Back to ILB Patrimony
Topics of productions
Affiliations
  • 2014 - 2019
    Centre de mathématiques appliquées
  • 2004 - 2019
    Laboratoire de probabilités et modèles aléatoires
  • 2018 - 2020
    Département de mathématiques et applications de l'ENS
  • 2014 - 2019
    Détermination de Formes Et Identification
  • 2004 - 2005
    Université Paris Diderot
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2012
  • 2005
  • AMF: Aggregated Mondrian forests for online learning.

    Jaouad MOURTADA, Stephane GAIFFAS, Erwan SCORNET
    Journal of the Royal Statistical Society: Series B (Statistical Methodology) | 2021
    No summary available.
  • Minimax optimal rates for Mondrian trees and forests.

    Jaouad MOURTADA, Stephane GAIFFAS, Erwan SCORNET
    The Annals of Statistics | 2020
    No summary available.
  • AMF: Aggregated Mondrian Forests for Online Learning.

    Jaouad MOURTADA, Stephane GAIFFAS, Erwan SCORNET
    2020
    Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are "offline" algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees. in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.
  • An improper estimator with optimal excess risk in misspecified density estimation and logistic regression.

    Jaouad MOURTADA, Stephane GAIFFAS
    2020
    We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for supervised statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors, addressing an open problem from Gr\"unwald and Kotlowski for the considered models, and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O ( (d + B^2R^2)/n )$, where $R$ bounds the norm of features and $B$ that of the comparison parameter. by contrast, no within-model estimator can achieve better rate than $\min( {B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a computationally more efficient alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly answering a question by Foster et al. (2018).
  • Machine Learning and Massive Health Data.

    Emmanuel BACRY, Stephane GAIFFAS
    Healthcare and Artificial Intelligence | 2020
    No summary available.
  • ZiMM: A deep learning model for long term and blurry relapses with non-clinical claims data.

    Anastasiia KABESHOVA, Yiyang YU, Bertrand LUKACS, Emmanuel BACRY, Stephane GAIFFAS
    Journal of Biomedical Informatics | 2020
    No summary available.
  • Sparse and low-rank multivariate Hawkes processes.

    Emmanuel BACRY, Martin BOMPAIRE, Stephane GAIFFAS, Jean francois MUZY
    Journal of Machine Learning Research | 2020
    We consider the problem of unveiling the implicit network structure of node interactions (such as user interactions in a social network), based only on high-frequency timestamps. Our inference is based on the minimization of the least-squares loss associated with a multivariate Hawkes model, penalized by L1 and trace norm of the interaction tensor. We provide a first theoretical analysis for this problem, that includes sparsity and low-rank inducing penalizations. This result involves a new data-driven concentration inequality for matrix martingales in continuous time with observable variance, which is a result of independent interest and a broad range of possible applications since it extends to matrix martingales former results restricted to the scalar case. A consequence of our analysis is the construction of sharply tuned L1 and trace-norm penalizations, that leads to a data-driven scaling of the variability of information available for each users. Numerical experiments illustrate the significant improvements achieved by the use of such data-driven penalizations.
  • SCALPEL3: A scalable open-source library for healthcare claims databases.

    Emmanuel BACRY, Stephane GAIFFAS, Fanny LEROY, Maryan MOREL, Dinh phong NGUYEN, Youcef SEBIAT, Dian SUN
    International Journal of Medical Informatics | 2020
    No summary available.
  • On the optimality of the Hedge algorithm in the stochastic regime.

    Jaouad MOURTADA, Stephane GAIFFAS
    Journal of Machine Learning Research | 2019
    In this paper, we study the behavior of the Hedge algorithm in the online stochastic setting. We prove that anytime Hedge with decreasing learning rate, which is one of the simplest algorithm for the problem of prediction with expert advice, is remarkably both worst-case optimal and adaptive to the easier stochastic and adversarial with a gap problems. This shows that, in spite of its small, non-adaptive learning rate, Hedge possesses the same optimal regret guarantee in the stochastic case as recently introduced adaptive algorithms. Moreover, our analysis exhibits qualitative differences with other versions of the Hedge algorithm, such as the fixed-horizon variant (with constant learning rate) and the one based on the so-called "doubling trick", both of which fail to adapt to the easier stochastic setting. Finally, we determine the intrinsic limitations of anytime Hedge in the stochastic case, and discuss the improvements provided by more adaptive algorithms.
  • Sparse inference of the drift of a high-dimensional Ornstein–Uhlenbeck process.

    Stephane GAIFFAS, Gustaw MATULEWICZ
    Journal of Multivariate Analysis | 2019
    No summary available.
  • Statistical learning from non-uniformed categorical variables.

    Patricio CERDA REYES, Gael VAROQUAUX, Marc SCHOENAUER, Gael VAROQUAUX, Marc SCHOENAUER, Laurent CHARLIN, Stephane GAIFFAS, Charles BOUVEYRON, Patrick VALDURIEZ, Balazs KEGL, Laurent CHARLIN, Stephane GAIFFAS
    2019
    Tabular data often contain categorical variables, which are considered non-numerical inputs with a fixed and limited number of unique elements, called categories. Many statistical learning algorithms require a numerical representation of the categorical variables. An encoding step is therefore necessary to transform these inputs into vectors. For this purpose, several strategies exist, the most common of which is one-hot encoding, which works well in the context of classical statistical analysis (in terms of predictive power and interpretation) when the number of categories remains low. However, non-uniformed categorical data present the risk of having a high cardinality and redundancies. Indeed, entries may share semantic and/or morphological information, and therefore, several entries may reflect the same entity. Without a cleaning or aggregation step beforehand, common encoding methods may lose efficiency due to an erroneous vector representation. Moreover, the risk of obtaining very high dimensional vectors increases with the amount of data, which prevents their use in big data analysis. In this paper, we study a series of encoding methods that allow us to work directly on high cardinality categorical variables, without the need to process them in advance. Using experiments conducted on real and simulated data, we demonstrate that the methods proposed in this thesis improve supervised learning, in part because of their ability to correctly capture morphological information from the inputs. Even with large data, these methods prove to be efficient, and in some cases, they generate vectors that are easily interpretable. Therefore, our methods can be applied to statistical machine learning (AutoML) without any human intervention.
  • Differentiating asthma from chronic obstructive pulmonary disease (COPD) in medico-economic databases: myth or reality?

    Milka MARAVIC, Raphael SIGOGNE, Arnaud BOURDIN, Nicolas ROCHE, Sara MOUNIR, Dejan MILIC, Morgan GEOFFROY, Stephane GAIFFAS, Emmanuel BACRY
    Epidemiology | 2019
    No summary available.
  • Machine learning based on Hawkes processes and stochastic optimization.

    Martin BOMPAIRE, Emmanuel BACRY, Stephane GAIFFAS, Alexandre GRAMFORT, Emmanuel BACRY, Stephane GAIFFAS, Alexandre GRAMFORT, Julien MAIRAL, Hansen NIELS RICHARD, Guillaume GARRIGOS, Julien MAIRAL, Hansen NIELS RICHARD
    2019
    The common thread of this thesis is the study of Hawkes processes. These point processes decipher the inter-causality that can occur between several series of events. Concretely, they determine the influence that the events of one series have on the future events of all other series. For example, in the context of social networks, they describe how likely a user's action, such as a Tweet, will be to trigger reactions from others.The first chapter is a brief introduction to point processes followed by a deeper look at Hawkes processes and in particular the properties of the most commonly used exponential kernel parameterization. In the next chapter, we introduce an adaptive penalty to model, with Hawkes processes, the propagation of information in social networks. This penalty is able to take into account a priori knowledge of the characteristics of these networks, such as sparse interactions between users or community structure, and reflect them on the estimated model. Our technique uses weighted penalties whose weights are determined by a fine-grained analysis of the generalization error.Next, we discuss convex optimization and the progress made with first order stochastic methods with variance reduction. The fourth chapter is dedicated to the adaptation of these techniques to optimize the data attachment term most commonly used with Hawkes processes. Indeed, this function does not verify the gradient-Lipschitz hypothesis usually used. Thus, we work with another regularity assumption, and obtain a linear convergence rate for a lagged version of Stochastic Dual Coordinate Ascent that improves the state of the art. Moreover, such functions have many linear constraints that are frequently violated by classical first-order algorithms, but in their dual version these constraints are much easier to satisfy. Thus, the robustness of our algorithm is more comparable to that of second-order methods which are prohibitively expensive in high dimensions.Finally, the last chapter presents a new statistical learning library for Python 3 with a particular focus on temporal models. Called tick, this library relies on a C++ implementation and state-of-the-art optimization algorithms to perform very fast estimates in a multi-core environment. Published on Github, this library has been used throughout this thesis to perform experiments.
  • Self-exclusion in online poker gamblers: effect on time and money as compared to matched controls.

    A DUGRAVOT, Henri PANJO, Amine BENYAMINA, Stephane GAIFFAS, Emmanuel BACRY, Luquiens AMANDINE
    International Journal of Environmental Research and Public Health | 2019
    No summary available.
  • ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection.

    Maryan MOREL, Emmanuel BACRY, Stephane GAIFFAS, Agathe GUILLOUX, Fanny LEROY
    Biostatistics | 2019
    With the increased availability of large electronic health records databases comes the chance of enhancing health risks screening. Most post-marketing detection of adverse drug reaction (ADR) relies on physicians' spontaneous reports, leading to under-reporting. To take up this challenge, we develop a scalable model to estimate the effect of multiple longitudinal features (drug exposures) on a rare longitudinal outcome. Our procedure is based on a conditional Poisson regression model also known as self-controlled case series (SCCS). To overcome the need of precise risk periods specification, we model the intensity of outcomes using a convolution between exposures and step functions, which are penalized using a combination of group-Lasso and total-variation. Up to our knowledge, this is the first SCCS model with flexible intensity able to handle multiple longitudinal features in a single model. We show that this approach improves the state-of-the-art in terms of mean absolute error and computation time for the estimation of relative risks on simulated data. We apply this method on an ADR detection problem, using a cohort of diabetic patients extracted from the large French national health insurance database (SNIIRAM), a claims database containing medical reimbursements of more than 53 million people. This work has been done in the context of a research partnership between Ecole Polytechnique and CNAMTS (in charge of SNIIRAM).
  • Self-Exclusion among Online Poker Gamblers: Effects on Expenditure in Time and Money as Compared to Matched Controls.

    Amandine LUQUIENS, Aline DUGRAVOT, Henri PANJO, Amine BENYAMINA, Stephane GAIFFAS, Emmanuel BACRY
    International Journal of Environmental Research and Public Health | 2019
    No comparative data is available to report on the effect of online self-exclusion. The aim of this study was to assess the effect of self-exclusion in online poker gambling as compared to matched controls, after the end of the self-exclusion period. Methods: We included all gamblers who were first-time self-excluders over a 7-year period (n = 4887) on a poker website, and gamblers matched for gender, age and account duration (n = 4451). We report the effects over time of self-exclusion after it ended, on money (net losses) and time spent (session duration) using an analysis of variance procedure between mixed models with and without the interaction of time and self-exclusion. Analyzes were performed on the whole sample, on the sub-groups that were the most heavily involved in terms of time or money (higher quartiles) and among short-duration self-excluders (<3 months). Results: Significant effects of self-exclusion and short-duration self-exclusion were found for money and time spent over 12 months. Among the gamblers that were the most heavily involved financially, no significant effect on the amount spent was found. Among the gamblers who were the most heavily involved in terms of time, a significant effect was found on time spent. Short-duration self-exclusions showed no significant effect on the most heavily involved gamblers. Conclusions: Self-exclusion seems efficient in the long term. However, the effect on money spent of self-exclusions and of short-duration self-exclusions should be further explored among the most heavily involved gamblers.
  • Differentiating asthma from chronic obstructive pulmonary disease (COPD) in medico-economic databases : Myth or reality.

    Milka MARAVIC, Raphael SIGOGNE, Nicolas ROCHE, Sarah MOUNIR, Dejan MILIC, Morgan GEOFFROY, Stephane GAIFFAS, Emmanuel BACRY, Arnaud BOURDIN
    ERS International Congress | 2019
    No summary available.
  • SCALPEL3: a scalable open-source library for healthcare claims databases.

    Emmanuel BACRY, Stephane GAIFFAS, Maryan MOREL, D.p. NGUYEN, Youcef SEBIAT, Dian SUN, Fanny LEROY
    2019
    No summary available.
  • ZiMM: a deep learning model for long term adverse events with non-clinical claims data.

    Emmanuel BACRY, Stephane GAIFFAS, Anastasiia KABESHOVA, Yiyang YU
    2019
    No summary available.
  • Dual Optimization for convex constrained objectives without the gradient-Lipschitz assumptions.

    Stephane GAIFFAS, Martin BOMPAIRE, Emmanuel BACRY
    2019
    No summary available.
  • ConvSCCS: convolutional self-controlled case-seris model for lagged adverser event detection.

    Maryan MOREL, Emmanuel BACRY, Stephane GAIFFAS, Agathe GUILLOUX, Fanny LEROY
    Biostatistics | 2019
    No summary available.
  • C-mix: A high-dimensional mixture model for censored durations, with applications to genetic data.

    Simon BUSSY, Agathe GUILLOUX, Stephane GAIFFAS, Anne sophie JANNOT
    Statistical Methods in Medical Research | 2018
    We introduce a supervised learning mixture model for censored durations (C-mix) to simultaneously detect subgroups of patients with different prognosis and order them based on their risk. Our method is applicable in a high-dimensional setting, i.e. with a large number of biomedical covariates. Indeed, we penalize the negative log-likelihood by the Elastic-Net, which leads to a sparse parameterization of the model and automatically pinpoints the relevant covariates for the survival prediction. Inference is achieved using an efficient Quasi-Newton Expectation Maximization (QNEM) algorithm, for which we provide convergence properties. The statistical performance of the method is examined on an extensive Monte Carlo simulation study, and finally illustrated on three publicly available genetic cancer datasets with high-dimensional co-variates. We show that our approach outperforms the state-of-the-art survival models in this context, namely both the CURE and Cox proportional hazards models penalized by the Elastic-Net, in terms of C-index, AUC(t) and survival prediction. Thus, we propose a powerfull tool for personalized medicine in cancerology.
  • High-dimensional robust regression and outliers detection with slope.

    Alain VIROULEAU, Agathe GUILLOUX, Stephane GAIFFAS, Malgorzata BOGDAN
    2018
    The problems of outliers detection and robust regression in a high-dimensional setting are fundamental in statistics, and have numerous applications. Following a recent set of works providing methods for simultaneous robust regression and outliers detection, we consider in this paper a model of linear regression with individual intercepts , in a high-dimensional setting. We introduce a new procedure for simultaneous estimation of the linear regression coefficients and intercepts, using two dedicated sorted-1 penalizations, also called SLOPE [5]. We develop a complete theory for this problem: first, we provide sharp upper bounds on the statistical estimation error of both the vector of individual intercepts and regression coefficients. Second, we give an asymptotic control on the False Discovery Rate (FDR) and statistical power for support selection of the individual intercepts. As a consequence, this paper is the first to introduce a procedure with guaranteed FDR and statistical power control for outliers detection under the mean-shift model. Numerical illustrations, with a comparison to recent alternative approaches, are provided on both simulated and several real-world datasets. Experiments are conducted using an open-source software written in Python and C++.
  • Uncovering Causality from Multivariate Hawkes Integrated Cumulants.

    Massil ACHAB, Emmanuel BACRY, Stephane GAIFFAS, Jean francois MUZY, Iacopo MASTROMATTEO
    Journal of Machine Learning Research | 2018
    We design a new nonparametric method that allows one to estimate the matrix of integrated kernels of a multivariate Hawkes process. This matrix not only encodes the mutual influences of each node of the process, but also disentangles the causality relationships between them. Our approach is the first that leads to an estimation of this matrix without any parametric modeling and estimation of the kernels themselves. As a consequence, it can give an estimation of causality relationships between nodes (or users), based on their activity timestamps (on a social network for instance), without knowing or estimating the shape of the activities lifetime. For that purpose, we introduce a moment matching method that fits the second-order and the third-order integrated cumulants of the process. A theoretical analysis allows us to prove that this new estimation technique is consistent. Moreover, we show, on numerical experiments, that our approach is indeed very robust with respect to the shape of the kernels and gives appealing results on the MemeTracker database and on financial order book data.
  • Description and assessment of trustability of motives for self-exclusion reported by online poker gamblers in a cohort using account-based gambling data.

    Amandine LUQUIENS, Delphine VENDRYES, Henri jean AUBIN, Amine BENYAMINA, Stephane GAIFFAS, Emmanuel BACRY
    BMJ Open | 2018
    No summary available.
  • High dimensional matrix estimation with unknown variance of the noise.

    Olga KLOPP, Stephane GAIFFAS
    Statistica Sinica | 2017
    We propose a new pivotal method for estimating high-dimensional matrices. Assume that we observe a small set of entries or linear combinations of entries of an unknown matrix $A_0$ corrupted by noise. We propose a new method for estimating $A_0$ which does not rely on the knowledge or an estimation of the standard deviation of the noise $\sigma$. Our estimator achieves, up to a logarithmic factor, optimal rates of convergence under the Frobenius risk and, thus, has the same prediction performance as previously proposed estimators which rely on the knowledge of $\sigma$. Our method is based on the solution of a convex optimization problem which makes it computationally attractive.
  • C-mix: a high dimensional mixture model for censored durations, with applications to genetic data.

    Simon BUSSY, Agathe GUILLOUX, Stephane GAIFFAS, Anne sophie JANNOT
    2017
    We introduce a supervised learning mixture model for censored durations (C-mix) to simultaneously detect subgroups of patients with different prognosis and order them based on their risk. Our method is applicable in a high-dimensional setting, i.e. with a large number of biomedical covariates. Indeed, we penalize the negative log-likelihood by the Elastic-Net, which leads to a sparse parameterization of the model and automatically pinpoints the relevant covariates for the survival prediction. Inference is achieved using an efficient Quasi-Newton Expectation Maximization (QNEM) algorithm, for which we provide convergence properties. The statistical performance of the method is examined on an extensive Monte Carlo simulation study, and finally illustrated on three publicly available genetic cancer datasets with high-dimensional co-variates. We show that our approach outperforms the state-of-the-art survival models in this context, namely both the CURE and Cox proportional hazards models penalized by the Elastic-Net, in terms of C-index, AUC(t) and survival prediction. Thus, we propose a powerfull tool for personalized medicine in cancerology.
  • Statistical inference of Ornstein-Uhlenbeck processes : generation of stochastic graphs, sparsity, applications in finance.

    Gustaw MATULEWICZ, Emmanuel GOBET, St?phane GA?FFAS, Mathieu ROSENBAUM, Emmanuel GOBET, St?phane GA?FFAS, Mohamed BEN ALAYA, Sylvain DELATTRE, Marina KLEPTSYNA, Markus REI?
    2017
    The subject of this thesis is the statistical inference of multidimensional Ornstein-Uhlenbeck processes. In a first part, we introduce a model of stochastic graphs defined as binary observations of trajectories. We then show that it is possible to deduce the dynamics of the underlying trajectory from the binary observations. For this, we construct statistics from the graph and show new convergence properties in the context of a long time and high frequency observation. We also analyze the properties of stochastic graphs from the point of view of evolving networks. In a second part, we work under the assumption of complete information and continuous time and add a sparsity assumption concerning the textit{drift} parameter of the Ornstein-Uhlenbeck process. We then show sharp oracle properties of the Lasso estimator, prove a lower bound on the estimation error in the minimax sense and show asymptotic optimality properties of the Adaptive Lasso estimator. We then apply these methods to estimate the speed of return at the average of daily returns of US stocks as well as the prices of dividend futures for the EURO STOXX 50 index.
  • Binarsity: a penalization for one-hot encoded features.

    Mokhtar z. ALAYA, Simon BUSSY, Stephane GAIFFAS, Agathe GUILLOUX
    2017
    This paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. We propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called binarsity. In each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint to avoid collinearity within groups. Non-asymptotic oracle inequalities for generalized linear models are proposed, and numerical experiments illustrate the good performances of our approach on several datasets. It is also noteworthy that our method has a numerical complexity comparable to standard L1 penalization.
  • Concentration inequalities for matrix martingales in continuous time.

    Emmanuel BACRY, Stephane GAIFFAS, Jean francois MUZY
    Probability Theory and Related Fields | 2017
    No summary available.
  • Statistical learning for event sequences using point processes.

    Massil ACHAB, Emmanuel BACRY, St?phane GA?FFAS, Nicolas VAYATIS, Emmanuel BACRY, St?phane GA?FFAS, Vincent RIVOIRARD, Manuel GOMEZ RODRIGUEZ, Nils richard HANSEN
    2017
    The goal of this thesis is to show that the arsenal of new optimization methods allows us to solve difficult estimation problems based on event models.These dated events are ordered chronologically and therefore cannot be considered as independent.This simple fact justifies the use of a particular mathematical tool called point process to learn a certain structure from these events. The first is the point process behind the Cox proportional hazards model: its conditional strength allows to define the hazard ratio, a fundamental quantity in the survival analysis literature.The Cox regression model relates the time to the occurrence of an event, called a failure, to the covariates of an individual.This model can be reformulated using the point process framework. The second is the Hawkes process which models the impact of past events on the probability of future events.The multivariate case allows to encode a notion of causality between the different dimensions considered.This theme is divided into three parts.The first part is concerned with a new optimization algorithm that we have developed.It allows to estimate the parameter vector of the Cox regression when the number of observations is very large.Our algorithm is based on the SVRG (Stochastic Variance Reduced Gradient) algorithm and uses an MCMC (Monte Carlo Marker Model) method.We have proved convergence speeds for our algorithm and have shown its numerical performance on simulated and real-world data sets.The second part shows that causality in the Hawkes sense can be reduced to a minimum. The second part shows that the causality in the Hawkes sense can be estimated in a non-parametric way thanks to the integrated cumulants of the multivariate point process.We have developed two methods for estimating the integrals of the kernels of the Hawkes process, without making any assumption on the shape of these kernels. Our methods are faster and more robust, with respect to the shape of the kernels, compared to the state of the art. We have demonstrated the statistical consistency of the first method, and have shown that the second one can be applied to a convex optimization problem.The last part highlights the order book dynamics using the first non-parametric estimation method introduced in the previous part.We have used data from the EUREX futures market, defined new order book models (e.g., the order book of the same day), and developed a new method for the estimation of the order book.We have used data from the EUREX futures market, developed new order book models (based on the previous work of Bacry et al.) and applied the estimation method on these point processes.The results obtained are very satisfactory and consistent with an economic analysis.Such a work proves that the method we have developed allows to extract a structure from data as complex as those from high-frequency finance.
  • Concentration for matrix martingales in continuous time and microscopic activity of social networks.

    Emmanuel BACRY, Stephane GAIFFAS, J. f. MUZY
    Probability Theory and Related Fields | 2017
    No summary available.
  • Counting Process Segmentation and Dynamic Models.

    Elmokhtar ezzahdi ALAYA, Stephane GAIFFAS, Agathe GUILLOUX, Pierre ALQUIER, Sylvain ARLOT, Gerard BIAU, Erwan LE PENNEC
    2016
    In the first part of this thesis, we aim at estimating the intensity of a counting process by statistical learning techniques in high dimension. We introduce an estimation procedure based on the total variation penalty with weights. A first set of results aims at studying the intensity under an a priori hypothesis of sparse segmentation. In a second part, we study the binarization technique for continuous explanatory variables, for which we construct a regularization specific to this problem. This regularization is called ``binarsity'', it penalizes different values of a vector of parameters. In the third part, we focus on dynamic regression for Aalen and Cox models with high dimensional coefficients and covariates, and which can depend on time. For each of the proposed estimation procedures, we demonstrate non-asymptotic oracle inequalities in prediction. We finally use proximal algorithms to solve the underlying convex problems, and we illustrate our methods on simulated and real data.
  • Mean-field inference of Hawkes point processes.

    Emmanuel BACRY, Stephane GAIFFAS, Iacopo MASTROMATTEO, Jean francois MUZY
    Journal of Physics A: Mathematical and Theoretical | 2016
    We propose a fast and efficient estimation method that is able to accurately recover the parameters of a d-dimensional Hawkes point-process from a set of observations. We exploit a mean-field approximation that is valid when the fluctuations of the stochastic intensity are small. We show that this is notably the case in situations when interactions are sufficiently weak, when the dimension of the system is high or when the fluctuations are self-averaging due to the large number of past events they involve. In such a regime the estimation of a Hawkes process can be mapped on a least-squares problem for which we provide an analytic solution. Though this estimator is biased, we show that its precision can be comparable to the one of the Maximum Likelihood Estimator while its computation speed is shown to be improved considerably. We give a theoretical control on the accuracy of our new approach and illustrate its efficiency using synthetic datasets, in order to assess the statistical estimation error of the parameters.
  • Mean-field inference of Hawkes point processes.

    Emmanuel BACRY, Stephane GAIFFAS, Iacopo MASTROMATTEO, Jean francois MUZY
    Journal of Physics A: Mathematical and Theoretical | 2016
    We propose a fast and efficient estimation method that is able to accurately recover the parameters of a d -dimensional Hawkes point-process from a set of observations. We exploit a mean-field approximation that is valid when the fluctuations of the stochastic intensity are small. We show that this is notably the case in situations when interactions are sufficiently weak, when the dimension of the system is high or when the fluctuations are self-averaging due to the large number of past events they involve. In such a regime the estimation of a Hawkes process can be mapped on a least-squares problem for which we provide an analytic solution. Though this estimator is biased, we show that its precision can be comparable to the one of the maximum likelihood estimator while its computation speed is shown to be improved considerably. We give a theoretical control on the accuracy of our new approach and illustrate its efficiency using synthetic datasets, in order to assess the statistical estimation error of the parameters.
  • Learning the intensity of time events with change-points.

    Mokhtar z. ALAYA, Stephane GAIFFAS, Agathe GUILLOUX
    2015
    We consider the problem of learning the inhomogeneous intensity of a counting process, under a sparse segmentation assumption. We introduce a weighted total-variation penalization, using data-driven weights that correctly scale the penalization along the observation interval. We prove that this leads to a sharp tuning of the convex relaxation of the segmentation prior, by stating oracle inequalities with fast rates of convergence, and consistency for change-points detection. This provides first theoretical guarantees for segmentation with a convex proxy beyond the standard i.i.d signal + white noise setting. We introduce a fast algorithm to solve this convex problem. Numerical experiments illustrate our approach on simulated and on a high-frequency genomics dataset.
  • Learning the Intensity of Time Events With Change-Points.

    Mokhtar z. ALAYA, Stephane GAIFFAS, Agathe GUILLOUX
    IEEE Transactions on Information Theory | 2015
    We consider the problem of learning the inhomogeneous intensity of a counting process, under a sparse segmentation assumption. We introduce a weighted total-variation penalization, using data-driven weights that correctly scale the penalization along the observation interval. We prove that this leads to a sharp tuning of the convex relaxation of the segmentation prior, by stating oracle inequalities with fast rates of convergence, and consistency for change-points detection. This provides first theoretical guarantees for segmentation with a convex proxy beyond the standard i.i.d signal + white noise setting. We introduce a fast algorithm to solve this convex problem. Numerical experiments illustrate our approach on simulated and on a high-frequency genomics dataset.
  • Regularization methods for prediction in dynamic graphs and e-marketing applications.

    Emile RICHARD, Nicolas VAYATIS, Francis BACH, Theodoros EVGENIOU, Stephane GAIFFAS, Michael irwin JORDAN, Thibaut MUNIER, Massimiliano PONTIL, Jean philippe VERT
    2012
    The prediction of connections between objects, based either on a noisy observation or on a sequence of observations, is a problem of interest for a number of applications ranging from the design of recommendation systems in e-commerce and social networks to network inference in molecular biology. This work presents formulations of the link prediction problem, in both static and temporal settings, as a regularized problem. In the static scenario it is the combination of two well-known norms, the L1-norm and the trace-norm that allows link prediction, while in the dynamic case the use of an autoregressive model on linear descriptors allows to improve the quality of prediction. We will study the nature of the solutions of the optimization problems both in statistical and algorithmic terms. Encouraging empirical results highlight the contribution of the adopted methodology.
  • Nonparametric regression and spatially inhomogeneous information.

    Stephane GAIFFAS, Marc HOFFMANN
    2005
    No summary available.
  • Nonparametric regression and spatially inhomogeneous information.

    Stephane GAIFFAS
    2005
    We study the nonparametric estimation of a signal based on inhomogeneous noisy data (the amount of data varies on the estimation domain). We consider the model of nonparametric regression with random design. Our aim is to understand the consequences of inhomogeneous data on the estimation problem in the minimax setup. Our approach is twofold: local and global. In the local setup, we want to recover the regression at a point with little, or much data. By translating this property into several assumptions on the design density, we obtain a large range of new minimax rates, containing very slow and very fast rates. Then, we construct a smoothness adaptive procedure, and we show that it converges with a minimax rate penalised by a minimal cost. In the global setup, we want to recover the regression with sup norm loss. We propose estimators converging with rates which are sensitive to the inhomogeneous behaviour of the information in the model. We prove the spatial optimality of these rates, which consists in an enforcement of the classical minimax lower bound for sup norm loss. In particular, we construct an asymptotically sharp estimator over Hölder balls with any smoothness, and a confidence band with a width which adapts to the local amount of data.
Affiliations are detected from the signatures of publications identified in scanR. An author can therefore appear to be affiliated with several structures or supervisors according to these signatures. The dates displayed correspond only to the dates of the publications found. For more information, see https://scanr.enseignementsup-recherche.gouv.fr