Cross-lingual Text Classification Using Topic-Dependent Word Probabilities
Autor: | Akihiro Tamura, Kunihiko Sadamasa, Masaaki Tsuchida, Daniel Andrade |
---|---|
Rok vydání: | 2015 |
Předmět: |
business.industry
Computer science Bilingual dictionary media_common.quotation_subject Statistical model Ambiguity Translation (geometry) computer.software_genre Word lists by frequency Source text Artificial intelligence business computer Classifier (UML) Natural language processing Word (computer architecture) media_common |
Zdroj: | HLT-NAACL |
DOI: | 10.3115/v1/n15-1170 |
Popis: | Cross-lingual text classification is a major challenge in natural language processing, since often training data is available in only one language (target language), but not available for the language of the document we want to classify (source language). Here, we propose a method that only requires a bilingual dictionary to bridge the language gap. Our proposed probabilistic model allows us to estimate translation probabilities that are conditioned on the whole source document. The assumption of our probabilistic model is that each document can be characterized by a distribution over topics that help to solve the translation ambiguity of single words. Using the derived translation probabilities, we then calculate the expected word frequency of each word type in the target language. Finally, these expected word frequencies can be used to classify the source text with any classifier that was trained using only target language documents. Our experiments confirm the usefulness of our proposed method. |
Databáze: | OpenAIRE |
Externí odkaz: |