JOSSE Julie

< Back to ILB Patrimony
Topics of productions
Affiliations
  • 2016 - 2020
    Ecole Polytechnique
  • 2015 - 2020
    Centre de mathématiques appliquées
  • 2018 - 2020
    Centre de recherche Inria de Paris
  • 2015 - 2020
    Modélisation statistique pour les sciences du vivant
  • 2012 - 2016
    Institut de recherche mathématique de Rennes
  • 2019 - 2021
    Centre de recherche Inria Sophia Antipolis - Méditerranée
  • 2015 - 2020
    Détermination de Formes Et Identification
  • 2016 - 2017
    Institut national de recherche en informatique et en automatique
  • 2012 - 2016
    Institut national supérieur des sciences agronomiques, agroalimentaires, horticoles et du paysage
  • 2015 - 2016
    Sélection de modèles en apprentissage statistique
  • 2013 - 2015
    Institut national d'enseignement supérieur et de recherche agronomique et agroalimentaire de Rennes, Agrocampus Ouest
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • Trauma reloaded: Trauma registry in the era of data science.

    Jean denis MOYER, Sophie rym HAMADA, Julie JOSSE, Auliard OLIVIER, Tobias GAUSS
    Anaesthesia Critical Care & Pain Medicine | 2021
    No summary available.
  • What's a good imputation to predict with missing values?

    Marine LE MORVAN, Julie JOSSE, Erwan SCORNET, Gael VAROQUAUX
    2021
    How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation may not be needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples.
  • Debiasing Stochastic Gradient Descent to handle missing values.

    Aude SPORTISSE, Claire BOYER, Aymeric DIEULEVEUT, Julie JOSSE
    2020
    A major caveat of large scale data is their incom-pleteness. We propose an averaged stochastic gradient algorithm handling missing values in linear models. This approach has the merit to be free from the need of any data distribution modeling and to account for heterogeneous missing proportion. In both streaming and finite-sample settings, we prove that this algorithm achieves convergence rate of O(1 n) at the iteration n, the same as without missing values. We show the convergence behavior and the relevance of the algorithm not only on synthetic data but also on real data sets, including those collected from medical register.
  • Hydroxychloroquine with or without azithromycin and in-hospital mortality or discharge in patients hospitalized for COVID-19 infection: a cohort study of 4,642 in-patients in France.

    Julie JOSSE, Alexandre GRAMFORT, Thomas MOREAU, Gael VAROQUAUX, Marc LAVIELLE
    2020
    Objective To assess the clinical effectiveness of oral hydroxychloroquine (HCQ) with or without azithromycin (AZI) in preventing death or leading to hospital discharge. Design Retrospective cohort study. Setting An analysis of data from electronic medical records and administrative claim data from the French Assistance Publique - Hopitaux de Paris (AP-HP) data warehouse, in 39 public hospitals, Ile-de-France, France. Participants All adult inpatients with at least one PCR-documented SARS-CoV-2 RNA from a nasopharyngeal sample between February 1st, 2020 and April 6th, 2020 were eligible for analysis. The study population was restricted to patients who did not receive COVID-19 treatments assessed in ongoing trials, including antivirals and immunosuppressive drugs. End of follow-up was defined as the date of death, discharge home, day 28 after admission, whichever occurred first, or administrative censoring on May 4, 2020. Intervention Patients were further classified into 3 groups: (i) receiving HCQ alone, (ii) receiving HCQ together with AZI, and (iii) receiving neither HCQ nor AZI. Exposure to a HCQ/AZI combination was defined as a simultaneous prescription of the 2 treatments (more or less one day). Main outcome measures The primary outcome was all-cause 28-day mortality as a time-to-event endpoint under a competing risks survival analysis framework. The secondary outcome was 28-day discharge home. Augmented inverse probability of treatment weighted (AIPTW) estimates of the average treatment effect (ATE) were computed to account for confounding. Results A total of 4,642 patients (mean age: 66.1 +/- 18. males: 2,738 (59%)) were included, of whom 623 (13.4%) received HCQ alone, 227 (5.9%) received HCQ plus AZI, and 3,792 (81.7%) neither drug. Patients receiving "HCQ alone" or "HCQ plus AZI" were more likely younger, males, current smokers and overall presented with slightly more co-morbidities (obesity, diabetes, any chronic pulmonary diseases, liver diseases), while no major difference was apparent in biological parameters. After accounting for confounding, no statistically significant difference was observed between the "HCQ" and "Neither drug" groups for 28-day mortality: AIPTW absolute difference in ATE was +1.24% (-5.63 to 8.12), ratio in ATE 1.05 (0.77 to 1.33). 28-day discharge rates were statistically significantly higher in the "HCQ" group: AIPTW absolute difference in ATE (+11.1% [3.30 to 18.9]), ratio in ATE (1.25 [1.07 to 1.42]). As for the "HCQ+AZI" vs neither drug, trends for significant differences and ratios in AIPTW ATE were found suggesting higher mortality rates in the former group (difference in ATE +9.83% [-0.51 to 20.17], ratio in ATE 1.40 [0.98 to 1.81].p=0.062). Conclusions Using a large non-selected population of inpatients hospitalized for COVID-19 infection in 39 hospitals in France and robust methodological approaches, we found no evidence for efficacy of HCQ or HCQ combined with AZI on 28-day mortality. Our results suggested a possible excess risk of mortality associated with HCQ combined with AZI, but not with HCQ alone. Significantly higher rates of discharge home were observed in patients treated by HCQ, a novel finding warranting further confirmation in replicative studies. Altogether, our findings further support the need to complete currently undergoing randomized clinical trials.
  • Statistical inference with incomplete and high-dimensional data - modeling polytraumatized patients.

    Wei JIANG, Julie JOSSE, Marc LAVIELLE, Bertrand THIRION, Daniel YEKUTIELI, Adeline LECLERCQ SAMSON, Pierre NEUVIAL, Daniel YEKUTIELI, Adeline LECLERCQ SAMSON
    2020
    The problem of missing data has existed since the early days of data analysis, as missing values are related to the process of obtaining and preparing data. In applications of modern statistics and machine learning, where data collection is becoming increasingly complex and multiple sources of information are combined, large databases often have extraordinarily high numbers of missing values. These data therefore present significant methodological and technical challenges for analysis: from visualization to modeling, estimation, variable selection, predictive capabilities and implementation. Furthermore, although high-dimensional data with missing values are considered common difficulties in statistical analysis today, only a few solutions are available.The objective of this thesis is to develop new methodologies for performing statistical inference with missing data and in particular for high-dimensional data. The most important contribution is to propose a complete framework for dealing with missing values, from estimation to model selection, based on likelihood approaches. The proposed method does not rely on a specific missingness feature, and allows a good balance between inference quality and efficient implementations.The contributions of the thesis are composed in three parts. In Chapter 2, we focus on logistic regression with missing values in a joint modeling framework, using a stochastic approximation of the EM algorithm. We study parameter estimation, variable selection and prediction for new incomplete observations. Through extensive simulations, we show that the estimators are unbiased and have good properties in terms of confidence interval coverage, outperforming the popular imputation-based approach. The method is then applied to pre-hospital data to predict the risk of hemorrhagic shock, in collaboration with medical partners - the Traumabase group of Paris hospitals. Indeed, the proposed model improves the prediction of the risk of bleeding compared to the prediction made by physicians.In Chapters 3 and 4, we focus on model selection issues for high-dimensional incomplete data, which aim in particular at controlling false discovery. For linear models, the adaptive Bayesian version of SLOPE (ABSLOPE) that we propose in Chapter 3 addresses these issues by incorporating l1 sorted regularization in a Bayesian 'spike and slab' framework. In Chapter 4, which targets more general models than linear regression, we consider these issues in a so-called 'model-X' framework, where the conditional distribution of the response as a function of the covariates is not specified. To do so, we combine a "knockoff" methodology with multiple imputations. Through a comprehensive simulation study, we demonstrate satisfactory performance in terms of power, FDR and estimation bias for a wide range of scenarios. In the application of the medical dataset, we build a model to predict patients' platelet levels from pre-hospital and hospital data.Finally, we provide two open source software packages with tutorials, in order to assist decision making in the medical field and users facing missing values.
  • Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework.

    Wei JIANG, Julie JOSSE, Marc LAVIELLE
    Computational Statistics & Data Analysis | 2020
    Logistic regression is a common classification method in supervised learning. Surprisingly , there are very few solutions for performing it and selecting variables in the presence of missing values. We develop a complete approach, including the estimation of parameters and variance of estimators, derivation of confidence intervals and a model selection procedure, for cases where the missing values can be anywhere in covariates. By well organizing different patterns of missingness in each observation , we propose a stochastic approximation version of the EM algorithm based on Metropolis-Hasting sampling, to perform statistical inference for logistic regression with incomplete data. We also tackle the problem of prediction for a new individual with missing values, which is never addressed. The methodology is computationally efficient, and its good coverage and variable selection properties are demonstrated in a simulation study where we contrast its performances to other methods. For instance, the popular multiple imputation by chained equation can lead to biased estimates while our method is unbiased. We then illustrate the method on a dataset of severely traumatized patients from Paris hospitals to predict the occurrence of hemorrhagic shock, a leading cause of early preventable death in severe trauma cases. The aim is to consolidate the current red flag procedure, a binary alert identifying patients with a high risk of severe hemorrhage. The methodology is implemented in the R package misaem.
  • Linear predictor on linearly-generated data with missing values: non consistency and solutions.

    Marine LE MORVAN, Nicolas PROST, Julie JOSSE, Erwan SCORNET, Gael VAROQUAUX
    2020
    We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing-value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available.
  • Causal inference methods for combining randomized trials and observational studies: a review.

    Benedicte COLNET, Imke MAYER, Guanhua CHEN, Awa DIENG, Ruohong LI, Gael VAROQUAUX, Jean philippe VERT, Julie JOSSE, Shu YANG
    2020
    With increasing data availability, treatment causal effects can be evaluated across different dataset, both randomized trials and observational studies. Randomized trials isolate the effect of the treatment from that of unwanted (confounding) co-occuring effects. But they may be applied to limited populations, and thus lack external validity. On the opposite large observational samples are often more representative of the target population but can conflate confounding effects with the treatment of interest. In this paper, we review the growing literature on methods for causal inference on combined randomized trial and observational studies, striving for the best of both worlds. We first discuss identification and estimation methods that improve generalizability of randomized controlled trials (RCTs) using the representativeness of observational data. Classical estimators include weighting, difference between conditional outcome models, and double robust estimators. We then discuss methods that combining RCTs and observational data to improve the (conditional) average treatment effect estimation, handling possible unmeasured confounding in the observational data. We also connect and contrast works developed in both the potential outcomes framework and the structural causal models framework. Finally, we compare the main methods using a simulation study and real world data to analyse the effect of tranexamic acid on the mortality rate in major trauma patients. Code to implement many of the methods is provided.
  • Imputation and low-rank estimation with Missing Not At Random data.

    Aude SPORTISSE, Claire BOYER, Julie JOSSE
    Statistics and Computing | 2020
    Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values. a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase ) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.
  • Some statistical learning problems in the presence of incomplete data.

    Maximilien BAUDRY, Christian yann ROBERT, Julie JOSSE, Christian yann ROBERT, Gerard BIAU, Anne laure FOUGERES, Thierry ARTIERES, Olivier LOPEZ
    2020
    Most statistical methods are not natively designed to work on incomplete data. The study of incomplete data is not new and many results have been established to overcome the incompleteness before the statistical study. On the other hand, deep learning methods are generally applied to unstructured data such as images, text or audio, but few works are interested in developing this type of approach on tabular data, and even less on incomplete data. This thesis focuses on the use of machine learning algorithms applied to tabular data, in the presence of incompleteness and in an insurance framework. Through the contributions gathered in this paper, we propose different ways to model complex phenomena in the presence of incompleteness patterns. We show that the proposed approaches give better results than the state of the art.
  • Multivariate analysis is sufficient for lesion-behaviour mapping.

    Lucas MARTIN, Julie JOSSE, Bertrand THIRION
    BrainLes 2020 | 2020
    Lesion-behaviour mapping aims at predicting individual be-havioural deficits, given a certain pattern of brain lesions. It also brings fundamental insights on brain organization, as lesions can be understood as interventions on normal brain function. We focus here on the case of stroke. The most standard approach to lesion-behaviour mapping is mass-univariate analysis, but it is inaccurate due to correlations between the different brain regions induced by vascularisation. Recently, it has been claimed that multivariate methods are also subject to lesion-anatomical bias, and that a move towards a causal approach is necessary to eliminate that bias. In this paper, we reframe the lesion-behaviour brain mapping problem using classical causal inference tools. We show that, in the absence of additional clinical data and if only one region has an effect on the behavioural scores, suitable multivariate methods are sufficient to address lesion-anatomical bias. This is a commonly encountered situation when working with public datasets, which very often lack general health data. We support our claim with a set of simulated experiments using a publicly available lesion imaging dataset, on which we show that adequate multivariate models provide state-of-the art results.
  • Adaptive Bayesian SLOPE—High-dimensional Model Selection with Missing Values.

    Wei JIANG, Malgorzata BOGDAN, Julie JOSSE, Blazej MIASOJEDOW, Veronika ROCKOVA
    2020
    We consider the problem of variable selection in high-dimensional settings with missing observations among the covariates. To address this relatively understudied problem, we propose a new synergistic procedure -- adaptive Bayesian SLOPE -- which effectively combines the SLOPE method (sorted l1 regularization) together with the Spike-and-Slab LASSO method. We position our approach within a Bayesian framework which allows for simultaneous variable selection and parameter estimation, despite the missing values. As with the Spike-and-Slab LASSO, the coefficients are regarded as arising from a hierarchical model consisting of two groups: (1) the spike for the inactive and (2) the slab for the active. However, instead of assigning independent spike priors for each covariate, here we deploy a joint "SLOPE" spike prior which takes into account the ordering of coefficient magnitudes in order to control for false discoveries. Through extensive simulations, we demonstrate satisfactory performance in terms of power, FDR and estimation bias under a wide range of scenarios. Finally, we analyze a real dataset consisting of patients from Paris hospitals who underwent a severe trauma, where we show excellent performance in predicting platelet levels. Our methodology has been implemented in C++ and wrapped into an R package ABSLOPE for public use.
  • Doubly robust treatment effect estimation with missing attributes.

    Imke MAYER, Erik SVERDRUP, Tobias GAUSS, Jean denis MOYER, Stefan WAGER, Julie JOSSE
    2020
    Missing attributes are ubiquitous in causal inference, as they are in most applied statistical work. In this paper, we consider various sets of assumptions under which causal inference is possible despite missing attributes and discuss corresponding approaches to average treatment effect estimation, including generalized propensity score methods and multiple imputation. Across an extensive simulation study, we show that no single method systematically out-performs others. We find, however, that doubly robust modifications of standard methods for average treatment effect estimation with missing data repeatedly perform better than their non-doubly robust baselines. for example, doubly robust generalized propensity score methods beat inverse-weighting with the generalized propensity score. This finding is reinforced in an analysis of an observations study on the effect on mortality of tranexamic acid administration among patients with traumatic brain injury in the context of critical care management. Here, doubly robust estimators recover confidence intervals that are consistent with evidence from randomized trials, whereas non-doubly robust estimators do not.
  • Robust Lasso-Zero for sparse corruption and model selection with missing covariates.

    Pascaline DESCLOUX, Claire BOYER, Julie JOSSE, Aude SPORTISSE, Sylvain SARDY
    2020
    We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0.
  • VARCLUST: clustering variables using dimensionality reduction.

    Piotr SOBCZYK, Malgorzata BOGDAN, Piotr GRACZYK, Julie JOSSE, Fabien PANLOUP, Valerie SEEGERS, Mateusz STANIAK, Stanislaw WILCZYNSKI
    2020
    VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed as the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the 'cluster dimensions', i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The "optimal" model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm of [29, 7] which was dedicated to subspaces of dimension only 1, and which is similar in structure to the K-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experi-mentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL) of [24], whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data, which show that the algorithm can lead to meaningful clustering. The proposed method is implemented in the publicly available R package varclust. Keywords variable clustering · Bayesian approach · k-means · dimensionality reduction · subspace clustering 2 P. Sobczyk, S. Wilczyński, M. Bogdan et al.
  • Neumann networks: differential programming for supervised learning with missing values.

    Marine LE MORVAN, Julie JOSSE, Thomas MOREAU, Erwan SCORNET, Gael VAROQUAUX
    2020
    The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann series approximation of the optimal predictor, we propose a new principled architecture, named Neumann networks. Their originality and strength comes from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of Neumann networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.
  • Input noise injection for supervised machine learning, with applications on genomic and image data.

    Beyrem KHALFAOUI, Jean philippe VERT, Veronique STOVEN, Jean philippe VERT, Julien CHIQUET, Gael VAROQUAUX, Julie JOSSE
    2019
    Overlearning is a general problem that affects statistical learning algorithms in different ways and has been approached in different ways in the literature. We first illustrate a real case of this problem in the context of a collaborative work aiming at predicting the response of rheumatoid arthritis patients to anti-inflammatory treatments. We then focus on the Noise Injection method in data in its generality as a regularization method. We give an overview of this method, its applications, insights, algorithms and some theoretical elements in the context of supervised learning. We then focus on the dropout method introduced in the context of deep learning and construct a new approximation allowing a new interpretation of this method in a general framework. We complement this study with experiments on simulations and real data. We then present a generalization of the noise injection method inspired by the noise inherent to certain types of data, which also allows variable selection. We present a new stochastic algorithm for this method, study its regularization properties and apply it to the context of single cell RNA sequencing. Finally, we present another generalization of the Noise Injection method where the introduced noise follows a structure that is adaptively inferred from the model parameters, as the covariance of the activations of the units to which it is applied. We study the theoretical properties of this new method called ASNI for linear models and multilayer neural networks. Finally, we show that ASNI improves the generalization performance of predictive models while improving the resulting representations.
  • Multiple imputation for mixed data by factor analysis.

    Vincent AUDIGIER, Francois HUSSON, Julie JOSSE, Matthieu RESCHE RIGON
    JdS2019 - 51es Journées de Statistique de la Société Française de Statistique | 2019
    Taking into account an ever-increasing amount of data makes their analysis increasingly complex. This complexity translates in particular into variables of different types, the presence of missing data, and a large number of variables and/or observations. The application of statistical methods in this context is generally delicate. The purpose of this presentation is to propose a new multiple imputation method based on mixed data factor analysis (MFFA). AFDM is a factorial analysis method adapted for data sets containing quantitative and qualitative variables, whose number may or may not exceed the number of observations. By virtue of its properties, the development of a multiple imputation method based on AFDM allows inference on incomplete quantitative and qualitative variables, in large and small dimensions. The proposed multiple imputation method uses a bootstrap approach to reflect the uncertainty on the principal components and eigenvectors of the AFDM, used here to predict (impute) the data. Each bootstrap replication then provides a prediction for the incomplete data set in the dataset. These predictions are then noised to reflect the distribution of the data. We thus obtain as many imputed tables as there are bootstrap replications. After recalling the principles of multiple imputation, we will present our methodology. The proposed method will be evaluated by simulation and compared to reference methods: sequential imputation by generalized linear model, imputation by mixture model and by general location model. The proposed method allows to obtain unbiased point estimates of different parameters of interest as well as confidence intervals at the expected recovery rate. Moreover, it can be applied to data sets of various nature and dimensions, allowing to treat cases where the number of observations is smaller than the number of variables. Abstract.
  • Model-based clustering with missing not at random data. Missing mechanism.

    Fabien LAPORTE, Christophe BIERNACKI, Gilles CELEUX, Julie JOSSE
    Working Group on Model-Based Clustering Summer Session | 2019
    Since the 90s, model-based clustering is largely used to classify data. Nowadays, with the increase of available data, missing values are more frequent. We defend the need to embed the missingness mechanism directly within the clustering model-ing step. There exist three types of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In all situations , logistic regression is proposed as a natural and exible candidate model. In this unied context, standard model selection criteria can be used to select between such dierent missing data mechanisms, simultaneously with the number of clusters. Practical interest of our proposal is illustrated on data derived from medical studies suffering from many missing data.
  • Dealing with missing data in model-based clustering through a MNAR model.

    Christophe BIERNACKI, Gilles CELEUX, Julie JOSSE, Fabien LAPORTE
    CRoNos & MDA 2019 - Meeting and Workshop on Multivariate Data Analysis and Software | 2019
    Since the 90s, model-based clustering is largely used to classify data. Nowadays, with the increase of available data, missing values are more frequent. Traditional ways to deal with them consist to obtain a filled data set, either by discarding missing values or by imputing them. In the first case some information is lost. in the second case the final clustering purpose is not taken into account through the imputation step. Thus both solutions risk to blur the clustering estimation result. Alternatively, we defend the need to embed the missingness mechanism directly within the clustering modeling step. There exists three types of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In all situations, logistic regression is proposed as a natural and flexible candidate model. In particular, its flexibility property allows to design some meaningful parsimonious variants, as dependency on missing values or dependency on the cluster label. In this unified context, standard model selection criteria can be used to select between such different missing data mechanisms, simultaneously with the number of clusters. Practical interest of our proposal is illustrated on data derived from medical studies suffering from many missing data.
  • Unsupervised classification models with non-random missing data.

    Fabien LAPORTE, Christophe BIERNACKI, Gilles CELEUX, Julie JOSSE
    51e journées de statistique | 2019
    The difficulty of accounting for missing data is often con-tained by assuming that their occurrence is due to chance. In this paper, we consider that the absence of some data is not due to chance in the context of unsupervised classification and we propose logistic models to reflect the fact that this occurrence can be associated with the sought classification. We focus on different models that we estimate by maximum likelihood and we analyze their characteristics through their application on hospital data.
  • Nonparametric Imputation by Data Depth.

    Pavlo MOZHAROVSKYI, Julie JOSSE, Francois HUSSON
    Journal of the American Statistical Association | 2019
    The presented methodology for single imputation of missing values borrows the idea from data depth --- a measure of centrality defined for an arbitrary point of the space with respect to a probability distribution or a data cloud. This consists in iterative maximization of the depth of each observation with missing values, and can be employed with any properly defined statistical depth function. On each single iteration, imputation is narrowed down to optimization of quadratic, linear, or quasiconcave function being solved analytically, by linear programming, or the Nelder-Mead method, respectively. Being able to grasp the underlying data topology, the procedure is distribution free, allows to impute close to the data, preserves prediction possibilities different to local imputation methods (k-nearest neighbors, random forest), and has attractive robustness and asymptotic properties under elliptical symmetry. It is shown that its particular case --- when using Mahalanobis depth --- has direct connection to well known treatments for multivariate normal model, such as iterated regression or regularized PCA. The methodology is extended to the multiple imputation for data stemming from an elliptically symmetric distribution. Simulation and real data studies positively contrast the procedure with existing popular alternatives. The method has been implemented as an R-package.
  • Biases in feature selection with missing data.

    Borja SEIJO PARDO, Amparo ALONSO BETANZOS, Kristin p. BENNETT, Veronica BOLON CANEDO, Julie JOSSE, Mehreen SAEED, Isabelle GUYON
    Neurocomputing | 2019
    No summary available.
  • Imputation of Mixed Data With Multilevel Singular Value Decomposition.

    Francois HUSSON, Julie JOSSE, Balasubramanian NARASIMHAN, Genevieve ROBIN
    Journal of Computational and Graphical Statistics | 2019
    Statistical analysis of large data sets offers new opportunities to better understand many processes. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources. As a consequence, such data sets often contain mixed data, i.e. both quantitative and qualitative and many missing values. Furthermore, aggregated data present a natural \textit{multilevel} structure, where individuals or samples are nested within different sites, such as countries or hospitals. Imputation of multilevel data has therefore drawn some attention recently, but current solutions are not designed to handle mixed data, and suffer from important drawbacks such as their computational cost. In this article, we propose a single imputation method for multilevel data, which can be used to complete either quantitative, categorical or mixed data. The method is based on multilevel singular value decomposition (SVD), which consists in decomposing the variability of the data into two components, the between and within groups variability, and performing SVD on both parts. We show on a simulation study that in comparison to competitors, the method has the great advantages of handling data sets of various size, and being computationally faster. Furthermore, it is the first so far to handle mixed data. We apply the method to impute a medical data set resulting from the aggregation of several data sets coming from different hospitals. This application falls in the framework of a larger project on Trauma patients. To overcome obstacles associated to the aggregation of medical data, we turn to distributed computation. The method is implemented in an R package.
  • Estimation and imputation in Probabilistic Principal Component Analysis with Missing Not At Random data.

    Aude SPORTISSE, Claire BOYER, Julie JOSSE
    2019
    Missing Not At Random values are considered to be non-ignorable and require defining a model for the missing values mechanism which involves strong a priori on the parametric form of the distribution and makes the inference or imputation tasks more complex. Methodologies to handle MNAR values also focus on simple settings assuming that only one variable (such as the outcome one) has missing entries. Recent work of Mohan and Pearl based on graphical models and causality show that specific settings of MNAR enable to recover some aspects of the distribution without specifying the MNAR mechanism. We pursue this line of research. Considering a data matrix generated from a probabilistic principal component analysis (PPCA) model containing several MNAR variables, not necessarily under the same self-masked missing mechanism, we propose estimators for the means, variances and covariances of the variables and study their consistency. The estima- tors present the great advantage of being computed by only using observed data. In addition, we propose an imputation method of the data matrix and an estimation of the PPCA loading matrix. We compare our proposal with results obtained for ignorable missing values based on the use of expectation-maximization algorithm.
  • Main Effects and Interactions in Mixed and Incomplete Data Frames.

    Genevieve ROBIN, Olga KLOPP, Julie JOSSE, Eric MOULINES, Robert TIBSHIRANI
    Journal of the American Statistical Association | 2019
    A mixed data frame (MDF) is a table collecting categorical, numerical and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. * This work has been funded by the DataScience Inititiative (Ecole Polytechnique) and the Russian Academic Excellence Project '5-100.
  • R-miss-tastic: a unified platform for missing values methods and workflows.

    Imke MAYER, Julie JOSSE, Nicholas TIERNEY, Nathalie VIALANEIX
    2019
    Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss, or biased analyses. Since the seminal work of Rubin (1976), there has been a burgeoning literature on missing values with heterogeneous aims and motivations. This has resulted in the development of various methods, formalizations, and tools (including a large number of R packages). However, for practitioners, it is challenging to decide which method is most suited for their problem, partially because handling missing data is still not a topic systematically covered in statistics or data science curricula. To help address this challenge, we have launched a unified platform: "R-miss-tastic", which aims to provide an overview of standard missing values problems, methods, how to handle them in analyses, and relevant implementations of methodologies. The objective is not only to collect, but also comprehensively organize materials, to create standard analysis workflows, and to unify the community. These overviews are suited for beginners, students, more advanced analysts and researchers.
  • On the consistency of supervised learning with missing values.

    Julie JOSSE, Nicolas PROST, Erwan SCORNET, Gael VAROQUAUX
    2019
    In many application settings, the data have missing features which make data analysis challenging. An abundant literature addresses missing data in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. We analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the ``missing incorporated in attribute'' method as it can handle both non-informative and informative missing values. −0.05 0 +0.05 −0.02 −0.01 0 +0.01 +0.02 −0.1 −0.05 0 +0.05 +0.1 0. M I A 2. i m p u t e m e a n + m a s k 3. i m p u t e m e a n 4. i m p u t e G a u s s i a n + m a s k 5. i m p u t e G a u s s i a n 6. r p a r t (s u r r o g a t e s) + m a s k 7. r p a r t (s u r r o g a t e s) 8. c t r e e (s u r r o g a t e s) + m a s k 9. c t r e e (s u r r o g a t e s) 0. M I A 2. i m p u t e m e a n + m a s k 3. i m p u t e m e a n 4. i m p u t e G a u s s i a n + m a s k 5. i m p u t e G a u s s i a n 0. M I A 1. b l o c k 2. i m p u t e m e a n + m a s k 3. i m p u t e m e a n 4. i m p u t e G a u s s i a n + m a s k 5. i m p u t e G a u s s i a n Relative explained variance.
  • Low-rank model with covariates for count data with missing values.

    Genevieve ROBIN, Julie JOSSE, Eric MOULINES, Sylvain SARDY
    Journal of Multivariate Analysis | 2019
    No summary available.
  • Imputation and low-rank estimation with Missing Non At Random data.

    Aude SPORTISSE, Claire BOYER, Julie JOSSE
    2019
    Missing values challenge data analysis because many supervised and unsu-pervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values . a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors.
  • Logistic Regression with Missing Covariates -- Parameter Estimation, Model Selection and Prediction.

    Wei JIANG, Julie JOSSE, Marc LAVIELLE
    2018
    Logistic regression is a common classification method in supervised learning. Surprisingly , there are very few solutions for performing it and selecting variables in the presence of missing values. We develop a complete approach, including the estimation of parameters and variance of estimators, derivation of confidence intervals and a model selection procedure, for cases where the missing values can be anywhere in covariates. By well organizing different patterns of missingness in each observation , we propose a stochastic approximation version of the EM algorithm based on Metropolis-Hasting sampling, to perform statistical inference for logistic regression with incomplete data. We also tackle the problem of prediction for a new individual with missing values, which is never addressed. The methodology is computationally efficient, and its good coverage and variable selection properties are demonstrated in a simulation study where we contrast its performances to other methods. For instance, the popular multiple imputation by chained equation can lead to biased estimates while our method is unbiased. We then illustrate the method on a dataset of severely traumatized patients from Paris hospitals to predict the occurrence of hemorrhagic shock, a leading cause of early preventable death in severe trauma cases. The aim is to consolidate the current red flag procedure, a binary alert identifying patients with a high risk of severe hemorrhage. The methodology is implemented in the R package misaem.
  • Low-rank Interaction with Sparse Additive Effects Model for Large Data Frames.

    Genevieve ROBIN, Hoi to WAI, Julie JOSSE, Olga KLOPP, Eric MOULINES
    32nd Conference on Neural Information Processing Systems (NeurIPS 2018) | 2018
    Many applications of machine learning involve the analysis of large data frames-matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples-with missing values. Low-rank models, as studied by Udell et al. [30], are popular in this framework for tasks such as visualization, clustering and missing value imputation. Yet, available methods with statistical guarantees and efficient optimization do not allow explicit modeling of main additive effects such as row and column, or covariate effects. In this paper, we introduce a low-rank interaction and sparse additive effects (LORIS) model which combines matrix regression on a dictionary and low-rank design, to estimate main effects and interactions simultaneously. We provide statistical guarantees in the form of upper bounds on the estimation error of both components. Then, we introduce a mixed coordinate gradient descent (MCGD) method which provably converges sub-linearly to an optimal solution and is computationally efficient for large scale data sets. We show on simulated and survey data that the method has a clear advantage over current practices, which consist in dealing separately with additive effects in a preprocessing step.
  • Dealing with missing data in model-based clustering through a MNAR model.

    Christophe BIERNACKI, Gilles CELEUX, Julie JOSSE, Fabien LAPORTE
    CMStatistics 2018 - 11th International Conference of the ERCIM WG on Computational and Methodological Statistics | 2018
    No summary available.
  • R for statistics and data science.

    Francois HUSSON, Eric MATZNER LOBER, Arnaud GUYADER, Pierre andre CORNILLON, Julie JOSSE, Laurent ROUVIERE, Nicolas KLUTCHNIKOFF, Benoit THIEURMEL, Nicolas JEGOU, Erwann LE PENNEC
    2018
    No summary available.
  • Bayesian Dimensionality Reduction With PCA Using Penalized Semi-Integrated Likelihood.

    Piotr SOBCZYK, Malgorzata BOGDAN, Julie JOSSE
    Journal of Computational and Graphical Statistics | 2017
    No summary available.
  • Multiple correspondence analysis and the multilogit bilinear model.

    William FITHIAN, Julie JOSSE
    Journal of Multivariate Analysis | 2017
    No summary available.
  • Discussion of “50 Years of Data Science”.

    Susan HOLMES, Julie JOSSE
    Journal of Computational and Graphical Statistics | 2017
    No summary available.
  • 50 years of data sciences, discussion.

    Julie JOSSE, Susan HOLMES
    Journal of Computational and Graphical Statistics | 2017
    No summary available.
  • Empirical Bayes approaches to PageRank type algorithms for rating scientific journals.

    Jean louis FOULLEY, Gilles CELEUX, Julie JOSSE
    2017
    Following criticisms against the journal Impact Factor, new journal influence scores have been developed such as the Eigenfactor or the Prestige Scimago Journal Rank. They are based on PageR-ank type algorithms on the cross-citations transition matrix of the citing-cited network. The PageR-ank algorithm performs a smoothing of the transition matrix combining a random walk on the data network and a teleportation to all possible nodes with fixed probabilities (the damping factor being α = 0.85). We reinterpret this smoothing matrix as the mean of a posterior distribution of a Dirichlet-multinomial model in an empirical Bayes perspective. We suggest a simple yet efficient way to make a clear distinction between structural and sampling zeroes. This allows us to contrast cases with self-citations are included or excluded to avoid overvalued journal bias. We estimate the model parameters by maximizing the marginal likelihood with a Majorize-Minimize algorithm. The procedure ends up with a score similar to the PageRank ones but with a damping factor depending on the journal at hand. The procedures are illustrated with an example about cross-citations among 47 statistical journals studied by Varin et al. (2016).
  • Some discussions on the Read Paper "Beyond subjective and objective in statistics" by A. Gelman and C. Hennig.

    Christian p. ROBERT, Gilles CELEUX, Jack JEWSON, Julie JOSSE, Jean michel MARIN
    2017
    This note is a collection of several discussions of the paper "Beyond subjective and objective in statistics", read by A. Gelman and C. Hennig to the Royal Statistical Society on April 12, 2017, and to appear in the Journal of the Royal Statistical Society, Series A.
  • Low-rank model with covariates for count data analysis.

    Genevieve ROBIN, Julie JOSSE, Eric MOULINES, Sylvain SARDY
    2017
    Count data are collected in many scientific and engineering tasks including image processing, single-cell RNA sequencing and ecological studies. Such data sets often contain missing values, for example because some ecological sites cannot be reached in a certain year. In addition, in many instances, side information is also available, for example covariates about ecological sites or species. Low-rank methods are popular to denoise and impute count data, and benefit from a substantial theoretical background. Extensions accounting for covariates have been proposed, but to the best of our knowledge their theoretical and empirical properties have not been thoroughly studied, and few softwares are available for practitioners. We propose a complete methodology called LORI (Low-Rank Interaction), including a Poisson model, an algorithm, and automatic selection of the regularization parameter, to analyze count tables with covariates. We also derive an upper bound on the estimation error. We provide a simulation study with synthetic data, revealing empirically that LORI improves on state of the art methods in terms of estimation and imputation of the missing values. We illustrate how the method can be interpreted through visual displays with the analysis of a well-know plant abundance data set, and show that the LORI outputs are consistent with known results. Finally we demonstrate the relevance of the methodology by analyzing a waterbirds abundance table from the French national agency for wildlife and hunting management (ONCFS). The method is available in the R package lori on the Comprehensive Archive Network (CRAN).
  • Jan de Leeuw and the French School of Data Analysis.

    Julie JOSSE
    Journal of Statistical Software | 2016
    The Dutch and the French schools of data analysis differ in their approaches to the question: How does one understand and summarize the information contained in a data set? The commonalities and discrepancies between the schools are explored here with a focus on methods dedicated to the analysis of categorical data, which are known either as homogeneity analysis (HOMALS) or multiple correspondence analysis (MCA).
  • Regulatory T Cells in Melanoma Revisited by a Computational Clustering of FOXP3+ T Cell Subpopulations.

    Hiroko FUJII, Julie JOSSE, Miki TANIOKA, Yoshiki MIYACHI, Francois HUSSON, Masahiro ONO
    The Journal of Immunology | 2016
    No summary available.
  • Measuring multivariate association and beyond.

    Julie JOSSE, Susan HOLMES
    Statistics Surveys | 2016
    No summary available.
  • missMDA: A Package for Handling Missing Values in Multivariate Data Analysis.

    Julie JOSSE, Francois HUSSON
    Journal of Statistical Software | 2016
    No summary available.
  • MIMCA: multiple imputation for categorical variables with multiple correspondence analysis.

    Vincent AUDIGIER, Francois HUSSON, Julie JOSSE
    Statistics and Computing | 2016
    We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal components method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small the number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method shows good performances in terms of bias and coverage for an analysis model such as a main effects logistic regression model. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.
  • Regulatory T Cells in Melanoma Revisited by a Computational Clustering of FOXP3+ T Cell Subpopulations.

    Hiroko FUJII, Julie JOSSE, Miki TANIOKA, Yoshiki MIYACHI, Francois HUSSON, Masahiro ONO
    Journal of Immunology | 2016
    No summary available.
  • Jan de Leeuw and the French School of Data Analysis.

    Francois HUSSON, Julie JOSSE, Gilbert SAPORTA
    Journal of Statistical Software | 2016
    The Dutch and the French schools of data analysis differ in their approaches to the question: How does one understand and summarize the information contained in a data set? The commonalities and discrepancies between the schools are explored here with a focus on methods dedicated to the analysis of categorical data, which are known either as homogeneity analysis (HOMALS) or multiple correspondence analysis (MCA).
  • Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood.

    Piotr SOBCZYK, Malgorzata BOGDAN, Julie JOSSE
    2016
    We discuss the problem of estimating the number of principal components in Principal Com- ponents Analysis (PCA). Despite of the importance of the problem and the multitude of solutions proposed in the literature, it comes as a surprise that there does not exist a coherent asymptotic framework which would justify different approaches depending on the actual size of the data set. In this paper we address this issue by presenting an approximate Bayesian approach based on Laplace approximation and introducing a general method for building the model selection criteria, called PEnalized SEmi-integrated Likelihood (PESEL). Our general framework encompasses a variety of existing approaches based on probabilistic models, like e.g. Bayesian Information Criterion for the Probabilistic PCA (PPCA), and allows for construction of new criteria, depending on the size of the data set at hand. Specifically, we define PESEL when the number of variables substantially exceeds the number of observations. We also report results of extensive simulation studies and real data analysis, which illustrate good properties of our proposed criteria as compared to the state-of- the-art methods and very recent proposals. Specifially, these simulations show that PESEL based criteria can be quite robust against deviations from the probabilistic model assumptions. Selected PESEL based criteria for the estimation of the number of principal components are implemented in R package varclust, which is available on github (https://github.com/psobczyk/varclust).
  • Confidence Areas for Fixed-Effects PCA.

    Julie JOSSE, Stefan WAGER, Francois HUSSON
    Journal of Computational and Graphical Statistics | 2016
    PCA is often used to visualize data when the rows and the columns are both of interest. In such a setting there is a lack of inferential methods on the PCA output. We study the asymptotic variance of a fixed-effects model for PCA, and propose several approaches to assessing the variability of PCA estimates: a method based on a parametric bootstrap, a new cell-wise jackknife, as well as a computationally cheaper approximation to the jackknife. We visualize the confidence regions by Procrustes rotation. Using a simulation study, we compare the proposed methods and highlight the strengths and drawbacks of each method as we vary the number of rows, the number of columns, and the strength of the relationships between variables.
  • denoiseR: A Package for Low Rank Matrix Estimation.

    Julie JOSSE, Sylvain SARDY, Stefan WAGER
    2016
    We present the R package denoiseR dedicated to low-rank matrix estimation. First, we briefly review existing methods including singular value shrinkage and a parametric bootstrap approach. Then, we discuss how to extend the methods to missing values and suggest a general iterative imputation algorithm. It includes an extension of the Stein Unbiased Risk Estimate to missing values for selecting tuning parameters. Finally, we compare and apply the methods with many experiments.
  • Multinomial Multiple Correspondence Analysis.

    Julie JOSSE, Patrick j. f. GROENEN
    2016
    Relations between categorical variables can be analyzed conveniently by multiple correspondence analysis (MCA). %It is well suited to discover relations that may exist between categories of different variables. The graphical representation of MCA results in so-called biplots makes it easy to interpret the most important associations. However, a major drawback of MCA is that it does not have an underlying probability model for an individual selecting a category on a variable. In this paper, we propose such probability model called multinomial multiple correspondence analysis (MMCA) that combines the underlying low-rank representation of MCA with maximum likelihood. An efficient majorization algorithm that uses an elegant bound for the second derivative is derived to estimate the parameters. The proposed model can easily lead to overfitting causing some of the parameters to wander of to infinity. We add the nuclear norm penalty to counter this issue and discuss ways of selecting regularization parameters. The proposed approach is well suited to study and vizualise the dependences for high dimensional data.
  • Multiple Correspondence Analysis & the Multilogit Bilinear Model.

    William FITHIAN, Julie JOSSE
    2016
    Multiple Correspondence Analysis (MCA) is a dimension reduction method which plays a large role in the analysis of tables with categorical nominal variables such as survey data. Though it is usually motivated and derived using geometric considerations, in fact we prove that it amounts to a single proximal Newtown step of a natural bilinear exponential family model for categorical data the multinomial logit bilinear model. We compare and contrast the behavior of MCA with that of the model on simulations and discuss new insights on the properties of both exploratory multivariate methods and their cognate models. One main conclusion is that we could recommend to approximate the multilogit model parameters using MCA. Indeed, estimating the parameters of the model is not a trivial task whereas MCA has the great advantage of being easily solved by singular value decomposition and scalable to large data.
  • Contribution to missing values & principal component methods.

    Julie JOSSE
    2016
    This manuscript was written for the Habilitation à Diriger des Recherches and it describes my research activities. The first part of this manuscript is named "A missing values tour with principal components methods". It first focuses on performing exploratory principal components (PCA based) methods despite missing values i.e. estimating parameters scores and loadings to get biplot representations from an incomplete data set. Then, it presents the use of principal components methods as single and multiple imputation for both continuous and categorical data. The second part concerns "New practices in visualization with principal components methods." It presents regularized versions of the principal components methods in the complete case and their potential impacts on the biplot graphical outputs.The contributions are part of the more general framework of low rank matrix estimation methods. Then, it discusses notions of variability of the parameters with confidence areas for fixed effect PCA either using bootstrap and Bayesian approaches.
  • Adaptive shrinkage of singular values.

    Julie JOSSE, Sylvain SARDY
    Statistics and Computing | 2015
    To estimate a low rank matrix from noisy observations, truncated singular value decomposition has been extensively used and studied: empirical singular values are hard thresholded and empirical singular vectors remain untouched. Recent estimators not only truncate but also shrink the singular values. In the same vein, we propose a continuum of thresholding and shrinking functions that encompasses hard and soft thresholding. To avoid an unstable and costly cross-validation search of their thresholding and shrinking parameters, we propose new rules to select these two regularization parameters from the data. In particular we propose a generalized Stein unbiased risk estimation criterion that does not require knowledge of the variance of the noise and that is computationally fast. A Monte Carlo simulation reveals that our estimator outperforms the tested methods in terms of mean squared error and rank estimation.
  • Stable Autoencoding: A Flexible Framework for Regularized Low-rank Matrix Estimation.

    Julie JOSSE, Stefan WAGER
    Procedia Computer Science | 2015
    We develop a framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple parametric bootstrap. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is robust with respect to the specified noise model. In the simplest case, with an isotropic noise model, our procedure is equivalent to a classical singular value shrinkage estimator. For non-isotropic noise models, however, our method does not reduce to singular value shrinkage, and instead yields new estimators that perform well in experiments. Moreover, by iterating our stable autoencoding scheme, we can automatically generate low-rank estimates without specifying the target rank as a tuning parameter.
  • Multiple imputation for continuous variables using a Bayesian principal component analysis.

    Vincent AUDIGIER, Francois HUSSON, Julie JOSSE
    Journal of Statistical Computation and Simulation | 2015
    We propose a multiple imputation method to deal with incomplete continuous data based on principal component analysis (PCA). To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study, the method is compared to two classical approaches: multiple imputation based on joint modeling and on fully conditional modeling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables. In addition, it provides a good point estimate of the parameter of interest, an estimate of the variability of the estimator reliable while reducing the width of the confidence intervals.
  • Different fluid intake patterns across the week can be identified in German adults.

    Isabelle GUELINCKX, Francois HUSSON, Erica PERRIER, Stella KEMGANG, Alexis KLEIN, Julie JOSSE
    FASEB Journal | 2014
    No summary available.
  • Principal component analysis with missing values: a comparative survey of methods.

    Stephane DRAY, Julie JOSSE
    Plant Ecology | 2014
    Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.
  • Multiple correspondence analysis.

    Francois HUSSON, Julie JOSSE
    Visualization and Verbalization of Data | 2014
    No summary available.
  • Another Look at Bayesian Analysis of AMMI Models for Genotype-Environment Data.

    Julie JOSSE, Fred VAN EEUWIJK, Hans peter PIEPHO, Jean baptiste DENIS
    Journal of Agricultural, Biological, and Environmental Statistics | 2014
    Linear–bilinear models are frequently used to analyze two-way data such as genotype-by-environment data. A well-known example of this class of models is the additive main effects and multiplicative interaction effects model (AMMI). We propose a new Bayesian treatment of such models offering a proper way to deal with the major problem of overparameterization. The rationale is to ignore the issue at the prior level and apply an appropriate processing at the posterior level to be able to arrive at easily interpretable inferences. Compared to previous attempts, this new strategy has the great advantage of being directly implementable in standard software packages devoted to Bayesian statistics such as WinBUGS/OpenBUGS/JAGS. The method is assessed using simulated datasets and a real dataset from plant breeding. We discuss the benefits of a Bayesian perspective to the analysis of genotype-by-environment interactions, focusing on practical questions related to general and local adaptation and stability of genotypes. We also suggest a new solution to the estimation of the risk of a genotype not exceeding a given threshold.
  • A principal component method to impute missing values for mixed data.

    Vincent AUDIGIER, Francois HUSSON, Julie JOSSE
    Advances in Data Analysis and Classification | 2014
    We propose a new method to impute missing values in mixed datasets. It is based on a principal components method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the dimensions of variability. Because the imputation uses the principal axes and components, the prediction of the missing values are based on the similarity between individuals and on the relationships between variables. The quality of the imputation is assessed through a simulation study and real datasets. The method is compared to a recent method (Stekhoven and Bühlmann, 2011) based on random forests and shows better performances especially for the imputation of categorical variables and when there are highly linear relationships between continuous variables.
  • Handling missing values in multiple factor analysis.

    Francois HUSSON, Julie JOSSE
    Food Quality and Preference | 2013
    No summary available.
  • Measures of dependence between random vectors and tests of independence.

    Julie JOSSE, Susan HOLMES
    2013
    The simple correlation coefficient between two variables has been generalized to measures of association between two matrices several times. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient or the Hilbert Schmidt Information Criterion (HSIC) have all been adopted by different communities. Scientists also use tests to measure whether two random variables are linked and then interpret the coefficients in context. many branches of science currently need multiway measures of association. The aim of this paper is to provide a small state of the art on the topic of measures of dependence between random vectors and tests of independence and to show links between different approaches. We document some of the interesting rediscoveries and lack of interconnection between bodies of literature. This review starts with a short history of randomization tests using distance matrices and some motivating examples. We then provide definition of the coefficients and associated tests. Finally we review a few of the recent modifications that have been proposed that provide improved properties and enhance ease of interpretation, as well as some prospective directions for future research.
  • Missing values in multi-level simultaneous component analysis.

    Julie JOSSE, Marieke e. TIMMERMAN, Henk a.l. KIERS
    Chemometrics and Intelligent Laboratory Systems | 2013
    Component analysis of data with missing values is often performed with algorithms of iterative imputation. However, this approach is prone to overfitting problems. As an alternative, Josse et al. (2009) proposed a regularized algorithm in the framework of Principal Component Analysis (PCA). Here we use a similar approach to deal with missing values in multi-level simultaneous component analysis (MLSCA), a method dedicated to explore multivariate multilevel data (e.g., individuals nested within groups). We discuss the properties of the regularized algorithm, the expected behavior under the missing (completely) at random (M(C)AR) mechanisms and possible dysmonotony problems. We explain the importance of separating the deviations due to sampling fluctuations and due to missing data. On the basis of a comparative extensive simulation study, we show that the regularized method generally performs well and clearly outperforms an EM-type of algorithm. (C) 2013 Elsevier B.V. All rights reserved.
  • Regularised PCA to denoise and visualise data.

    Marie VERBANCK, Julie JOSSE, Francois HUSSON
    Statistics and Computing | 2013
    Principal component analysis (PCA) is a well-established method commonly used to explore and visualise data. A classical PCA model is the fixed effect model where data are generated as a fixed structure of low rank corrupted by noise. Under this model, PCA does not provide the best recovery of the underlying signal in terms of mean squared error. Following the same principle as in ridge regression, we propose a regularised version of PCA that boils down to threshold the singular values. Each singular value is multiplied by a term which can be seen as the ratio of the signal variance over the total variance of the associated dimension. The regularised term is analytically derived using asymptotic results and can also be justified from a Bayesian treatment of the model. Regularised PCA provides promising results in terms of the recovery of the true signal and the graphical outputs in comparison with classical PCA and with a soft thresholding estimation strategy. The gap between PCA and regularised PCA is all the more important that data are noisy.
Affiliations are detected from the signatures of publications identified in scanR. An author can therefore appear to be affiliated with several structures or supervisors according to these signatures. The dates displayed correspond only to the dates of the publications found. For more information, see https://scanr.enseignementsup-recherche.gouv.fr