Cross-lingual text alignment for fine-grained plagiarism detection

Autor: Nava Ehsan, Azadeh Shakery, Frank Wm. Tompa
Rok vydání: 2018
Předmět:
Zdroj: Journal of Information Science. 45:443-459
ISSN: 1741-6485
0165-5515
Popis: Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 and PAN-2014 competitions.
Databáze: OpenAIRE