LINA: Identifying Comparable Documents from Wikipedia

Autor: Florian Boudin, Elizaveta Loginova-Clouet, Emmanuel Morin, Amir Hazem
Přispěvatelé: TALN, Laboratoire d'Informatique de Nantes Atlantique (LINA), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS), ANR-12-CORD-0020,CRISTAL,Contextes RIches en connaissanceS pour la TrAduction terminoLogique(2012)
Jazyk: angličtina
Rok vydání: 2015
Předmět:
Zdroj: 8th Workshop on Building and Using Comparable Corpora (BUCC)
8th Workshop on Building and Using Comparable Corpora (BUCC), Jul 2015, Pékin, China
BUCC@ACL/IJCNLP
Popis: International audience; This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information .
Databáze: OpenAIRE