Cross-lingual Text Classification Using Topic-Dependent Word Probabilities

Autor:	Akihiro Tamura, Kunihiko Sadamasa, Masaaki Tsuchida, Daniel Andrade
Rok vydání:	2015
Předmět:	business.industry Computer science Bilingual dictionary media_common.quotation_subject Statistical model Ambiguity Translation (geometry) computer.software_genre Word lists by frequency Source text Artificial intelligence business computer Classifier (UML) Natural language processing Word (computer architecture) media_common
Zdroj:	HLT-NAACL
DOI:	10.3115/v1/n15-1170
Popis:	Cross-lingual text classification is a major challenge in natural language processing, since often training data is available in only one language (target language), but not available for the language of the document we want to classify (source language). Here, we propose a method that only requires a bilingual dictionary to bridge the language gap. Our proposed probabilistic model allows us to estimate translation probabilities that are conditioned on the whole source document. The assumption of our probabilistic model is that each document can be characterized by a distribution over topics that help to solve the translation ambiguity of single words. Using the derived translation probabilities, we then calculate the expected word frequency of each word type in the target language. Finally, these expected word frequencies can be used to classify the source text with any classifier that was trained using only target language documents. Our experiments confirm the usefulness of our proposed method.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::9a39e4fcc80cacbc8d25de1594f12591 https://doi.org/10.3115/v1/n15-1170 Zobrazit plný text záznamu