Developing an Arabic Plagiarism Detection Corpus

Autor: Imtiaz Hussain Khan, Salma Omar Elhaj, Saudi Arabia, Kamal Mansoor Jambi, Muazzam Ahmed Siddiqui, Abobakr Ahmed Bagais
Rok vydání: 2014
Předmět:
Zdroj: Computer Science & Information Technology ( CS & IT ).
Popis: A corpus is a collection of documents. It is a valuable resource in linguistics research to perform statistical analysis and testing hypothesis for different linguistic rules. An annotated corpus consists of documents or entities annotated with some task related labels such as part of speech tags, sentiment etc One such task is plagiarism detection that seeks to identify if a given document is plagiarized or not. This paper describes our efforts to build a plagiarism detection corpus for Arabic. The corpus consists of about 350 plagiarized – source document pairs and more than 250 documents where no plagiarism was found. The plagiarized documents consists of students submitted assignments. For each of the plagiarized documents, the source document was located from the Web and downloaded for further investigation. We report corpus statistics including number of documents, number of sentences and number of tokens for each of the plagiarized and source categories.
Databáze: OpenAIRE