Statistical inference with incomplete and high-dimensional data - modeling polytraumatized patients.

Authors
  • JIANG Wei
  • JOSSE Julie
  • LAVIELLE Marc
  • THIRION Bertrand
  • YEKUTIELI Daniel
  • LECLERCQ SAMSON Adeline
  • NEUVIAL Pierre
  • YEKUTIELI Daniel
  • LECLERCQ SAMSON Adeline
Publication date
2020
Publication type
Thesis
Summary The problem of missing data has existed since the early days of data analysis, as missing values are related to the process of obtaining and preparing data. In applications of modern statistics and machine learning, where data collection is becoming increasingly complex and multiple sources of information are combined, large databases often have extraordinarily high numbers of missing values. These data therefore present significant methodological and technical challenges for analysis: from visualization to modeling, estimation, variable selection, predictive capabilities and implementation. Furthermore, although high-dimensional data with missing values are considered common difficulties in statistical analysis today, only a few solutions are available.The objective of this thesis is to develop new methodologies for performing statistical inference with missing data and in particular for high-dimensional data. The most important contribution is to propose a complete framework for dealing with missing values, from estimation to model selection, based on likelihood approaches. The proposed method does not rely on a specific missingness feature, and allows a good balance between inference quality and efficient implementations.The contributions of the thesis are composed in three parts. In Chapter 2, we focus on logistic regression with missing values in a joint modeling framework, using a stochastic approximation of the EM algorithm. We study parameter estimation, variable selection and prediction for new incomplete observations. Through extensive simulations, we show that the estimators are unbiased and have good properties in terms of confidence interval coverage, outperforming the popular imputation-based approach. The method is then applied to pre-hospital data to predict the risk of hemorrhagic shock, in collaboration with medical partners - the Traumabase group of Paris hospitals. Indeed, the proposed model improves the prediction of the risk of bleeding compared to the prediction made by physicians.In Chapters 3 and 4, we focus on model selection issues for high-dimensional incomplete data, which aim in particular at controlling false discovery. For linear models, the adaptive Bayesian version of SLOPE (ABSLOPE) that we propose in Chapter 3 addresses these issues by incorporating l1 sorted regularization in a Bayesian 'spike and slab' framework. In Chapter 4, which targets more general models than linear regression, we consider these issues in a so-called 'model-X' framework, where the conditional distribution of the response as a function of the covariates is not specified. To do so, we combine a "knockoff" methodology with multiple imputations. Through a comprehensive simulation study, we demonstrate satisfactory performance in terms of power, FDR and estimation bias for a wide range of scenarios. In the application of the medical dataset, we build a model to predict patients' platelet levels from pre-hospital and hospital data.Finally, we provide two open source software packages with tutorials, in order to assist decision making in the medical field and users facing missing values.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr