Statistical learning from non-uniformed categorical variables.

Authors
  • CERDA REYES Patricio
  • VAROQUAUX Gael
  • SCHOENAUER Marc
  • VAROQUAUX Gael
  • SCHOENAUER Marc
  • CHARLIN Laurent
  • GAIFFAS Stephane
  • BOUVEYRON Charles
  • VALDURIEZ Patrick
  • KEGL Balazs
  • CHARLIN Laurent
  • GAIFFAS Stephane
Publication date
2019
Publication type
Thesis
Summary Tabular data often contain categorical variables, which are considered non-numerical inputs with a fixed and limited number of unique elements, called categories. Many statistical learning algorithms require a numerical representation of the categorical variables. An encoding step is therefore necessary to transform these inputs into vectors. For this purpose, several strategies exist, the most common of which is one-hot encoding, which works well in the context of classical statistical analysis (in terms of predictive power and interpretation) when the number of categories remains low. However, non-uniformed categorical data present the risk of having a high cardinality and redundancies. Indeed, entries may share semantic and/or morphological information, and therefore, several entries may reflect the same entity. Without a cleaning or aggregation step beforehand, common encoding methods may lose efficiency due to an erroneous vector representation. Moreover, the risk of obtaining very high dimensional vectors increases with the amount of data, which prevents their use in big data analysis. In this paper, we study a series of encoding methods that allow us to work directly on high cardinality categorical variables, without the need to process them in advance. Using experiments conducted on real and simulated data, we demonstrate that the methods proposed in this thesis improve supervised learning, in part because of their ability to correctly capture morphological information from the inputs. Even with large data, these methods prove to be efficient, and in some cases, they generate vectors that are easily interpretable. Therefore, our methods can be applied to statistical machine learning (AutoML) without any human intervention.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr