LINA: Identifying Comparable Documents from Wikipedia
Autor: | Florian Boudin, Elizaveta Loginova-Clouet, Emmanuel Morin, Amir Hazem |
---|---|
Přispěvatelé: | TALN, Laboratoire d'Informatique de Nantes Atlantique (LINA), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS), ANR-12-CORD-0020,CRISTAL,Contextes RIches en connaissanceS pour la TrAduction terminoLogique(2012) |
Jazyk: | angličtina |
Rok vydání: | 2015 |
Předmět: | |
Zdroj: | 8th Workshop on Building and Using Comparable Corpora (BUCC) 8th Workshop on Building and Using Comparable Corpora (BUCC), Jul 2015, Pékin, China BUCC@ACL/IJCNLP |
Popis: | International audience; This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information . |
Databáze: | OpenAIRE |
Externí odkaz: |