Latent Dirichlet Allocation for Linking User-Generated Content and e-Commerce Data

Automatic linking of online content improves navigation possibilities for end users. We focus on linking content generated by users to other relevant sites. In particular, we study the problem of linking information between different usages of the same language, e.g., colloquial and formal idioms or the language of consumers versus the language of sellers. The challenge is

Examples that may be linked between user-generated content from and e-commerce data from We see the difference in language usage between social media and online products . On each row, the items are the same (or very similar), but the textual description differs. This difference in language makes it difficult to link the items as referring to the related objects.

that the same items are described using very distinct vocabularies. As a case study, we investigate a new task of linking textual pins (colloquial) to online webshops (formal). We evaluate three different modeling paradigms based on probabilistic topic modeling: monolingual latent Dirichlet allocation (LDA), bilingual LDA (BiLDA) and a novel multi-idiomatic LDA model (MiLDA). We compare these to the unigram model with Dirichlet prior. Our results for all three topic models reveal the usefulness of modeling the hidden thematic structure of the data through topics. Our proposed MiLDA model is able to deal with intrinsic multi-idiomatic data by considering the shared vocabulary between the aligned document pairs.

Susana Zoghbi, Ivan Vulic, Sien Moens
Information Sciences, 2016

Leave a Reply