Cross-lingual text alignment for fine-grained plagiarism detection

Autor:	Nava Ehsan, Azadeh Shakery, Frank Wm. Tompa
Rok vydání:	2018
Předmět:	Cross lingual Computer science business.industry Text alignment 05 social sciences 02 engineering and technology Library and Information Sciences Document analysis Translation (geometry) computer.software_genre Conjunction (grammar) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Plagiarism detection Artificial intelligence 0509 other social sciences 050904 information & library sciences business computer Natural language processing Information Systems Range (computer programming)
Zdroj:	Journal of Information Science. 45:443-459
ISSN:	1741-6485 0165-5515
Popis:	Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 and PAN-2014 competitions.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::fd02fde2068dfff755280356cd500cee https://doi.org/10.1177/0165551518787696 Zobrazit plný text záznamu