Cross-lingual Text Classification Using Topic-Dependent Word Probabilities

Autor: Akihiro Tamura, Kunihiko Sadamasa, Masaaki Tsuchida, Daniel Andrade
Rok vydání: 2015
Předmět:
Zdroj: HLT-NAACL
DOI: 10.3115/v1/n15-1170
Popis: Cross-lingual text classification is a major challenge in natural language processing, since often training data is available in only one language (target language), but not available for the language of the document we want to classify (source language). Here, we propose a method that only requires a bilingual dictionary to bridge the language gap. Our proposed probabilistic model allows us to estimate translation probabilities that are conditioned on the whole source document. The assumption of our probabilistic model is that each document can be characterized by a distribution over topics that help to solve the translation ambiguity of single words. Using the derived translation probabilities, we then calculate the expected word frequency of each word type in the target language. Finally, these expected word frequencies can be used to classify the source text with any classifier that was trained using only target language documents. Our experiments confirm the usefulness of our proposed method.
Databáze: OpenAIRE