Developing an Arabic Plagiarism Detection Corpus
Autor: | Imtiaz Hussain Khan, Salma Omar Elhaj, Saudi Arabia, Kamal Mansoor Jambi, Muazzam Ahmed Siddiqui, Abobakr Ahmed Bagais |
---|---|
Rok vydání: | 2014 |
Předmět: |
Information retrieval
ComputingMilieux_THECOMPUTINGPROFESSION Computer science Arabic business.industry ComputingMilieux_LEGALASPECTSOFCOMPUTING computer.software_genre Part of speech language.human_language Task (project management) Corpus linguistics ComputingMilieux_COMPUTERSANDEDUCATION ComputingMethodologies_DOCUMENTANDTEXTPROCESSING language Statistical analysis Plagiarism detection Artificial intelligence Testing hypothesis business computer Natural language processing |
Zdroj: | Computer Science & Information Technology ( CS & IT ). |
Popis: | A corpus is a collection of documents. It is a valuable resource in linguistics research to perform statistical analysis and testing hypothesis for different linguistic rules. An annotated corpus consists of documents or entities annotated with some task related labels such as part of speech tags, sentiment etc One such task is plagiarism detection that seeks to identify if a given document is plagiarized or not. This paper describes our efforts to build a plagiarism detection corpus for Arabic. The corpus consists of about 350 plagiarized – source document pairs and more than 250 documents where no plagiarism was found. The plagiarized documents consists of students submitted assignments. For each of the plagiarized documents, the source document was located from the Web and downloaded for further investigation. We report corpus statistics including number of documents, number of sentences and number of tokens for each of the plagiarized and source categories. |
Databáze: | OpenAIRE |
Externí odkaz: |