We study the problem of linking information between different idiomatic usages of the same language, for example, colloquial and formal language. We propose a novel probabilistic topic model called multi-idiomatic LDA (MiLDA). Its modeling principles follow the intuition that certain words are shared between two idioms of the same language, while other words are non-shared. We demonstrate the ability of our model to learn relations between cross-idiomatic topics in a dataset containing product descriptions and reviews. We present the utility of the new MiLDA topic model in a recently proposed information retrieval task of linking Pinterest pins to online webshops . We show that our multi-idiomatic model outperforms the standard monolingual LDA model and the pure bilingual LDA model both in terms of perplexity and MAP scores in the IR task.
Ivan Vulic, Susana Zoghbi and Sien Moens
ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’14)