Inference and applications for topic models.

Authors
Publication date
2017
Publication type
Thesis
Summary Most of the current recommendation systems are based on ratings (i.e., numbers between 0 and 5) to recommend a content (movie, restaurant.) to a user. The latter often has the possibility to comment on this content in the form of text in addition to rating it. It is difficult to extract information from a raw text while a simple note contains little information about the content and the user. In this thesis, we attempt to suggest a personalized readable text to the user to help him/her quickly form an opinion about a content. More specifically, we first build a thematic model predicting a personalized movie description from textual comments. Our model separates qualitative (i.e., opinionated) themes from descriptive themes by combining textual comments and number scores in an attached probabilistic model. We evaluate our model on an IMDB database and illustrate its performance through theme comparison. We then study parameter inference in large-scale latent variable models, including most theme models. We propose a unified treatment of online inference for latent variable models from non-canonical exponential families and make explicit the links between several previously proposed frequentist and Bayesian methods. We also propose a new inference method for frequentist parameter estimation that adapts MCMC methods to online inference of latent variable models by properly using local Gibbs sampling. For the latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work in which our new approach performs better than previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) that can be manipulated for parameter inference and learning in potentially sub-linear time in the number of objects. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous PPDs and PPDs defined over an exponential number of objects. We apply this class to the modeling of text documents as samples of a PPD on sentences and propose a conditional maximum likelihood formulation for modeling topic proportions, which is made possible without any approximation with our class of PPDs. We present an application to document summarization with a PPD on 2 to the power of 500 objects, where the summaries are composed of readable sentences.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr