Statistical learning from non-uniformed categorical variables.

Authors

CERDA REYES Patricio
VAROQUAUX Gael
SCHOENAUER Marc
VAROQUAUX Gael
SCHOENAUER Marc
CHARLIN Laurent
GAIFFAS Stephane
BOUVEYRON Charles
VALDURIEZ Patrick
KEGL Balazs
CHARLIN Laurent
GAIFFAS Stephane

Publication date

2019

Publication type

Thesis

Summary Tabular data often contain categorical variables, which are considered non-numerical inputs with a fixed and limited number of unique elements, called categories. Many statistical learning algorithms require a numerical representation of the categorical variables. An encoding step is therefore necessary to transform these inputs into vectors. For this purpose, several strategies exist, the most common of which is one-hot encoding, which works well in the context of classical statistical analysis (in terms of predictive power and interpretation) when the number of categories remains low. However, non-uniformed categorical data present the risk of having a high cardinality and redundancies. Indeed, entries may share semantic and/or morphological information, and therefore, several entries may reflect the same entity. Without a cleaning or aggregation step beforehand, common encoding methods may lose efficiency due to an erroneous vector representation. Moreover, the risk of obtaining very high dimensional vectors increases with the amount of data, which prevents their use in big data analysis. In this paper, we study a series of encoding methods that allow us to work directly on high cardinality categorical variables, without the need to process them in advance. Using experiments conducted on real and simulated data, we demonstrate that the methods proposed in this thesis improve supervised learning, in part because of their ability to correctly capture morphological information from the inputs. Even with large data, these methods prove to be efficient, and in some cases, they generate vectors that are easily interpretable. Therefore, our methods can be applied to statistical machine learning (AutoML) without any human intervention.

See the publication

Topics of the publication

Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr