A monolingual approach to detection of text reuse in Russian-English collection

Autor:	Oleg Bakhteev, Rita Kuznetsova, Alexey Romanov, Anton Khritankov
Rok vydání:	2015
Předmět:	Information retrieval Computer science business.industry media_common.quotation_subject Cosine similarity Sample (statistics) Snippet Reuse computer.software_genre Similarity (network science) Metric (mathematics) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Quality (business) Evaluation of machine translation Artificial intelligence business computer Natural language processing media_common
Zdroj:	ResearcherID
Popis:	In this paper we develop a method for cross-lingual (Russian and English) text reuse detection. The method is based on the monolingual approach — translation of texts into one language and reduction to the text similarity problem. We split texts into non-overlapping fragments and compare fragments to each other by means of different metrics — BLEU(1–2), ME-TEOR, cosine similarity between bag-of-words representations of each snippet, and cosine similarity between vectors obtained from doc2vec-trained model. We explore the impact of choice of metric on the quality of text reuse detection. We assess quality of the method on a sample of a hundred scientific documents, originally in Russian, machine translated into English. Preliminary findings demonstrate feasibility of the approach.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::640a894a07e99a23f1f18f4ebeb03b85 https://doi.org/10.1109/ainl-ismw-fruct.2015.7382960 Zobrazit plný text záznamu