THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Autor:	Serkan Gunal, Muhammet Yasin Pak
Přispěvatelé:	Anadolu Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Günal, Serkan
Jazyk:	angličtina
Rok vydání:	2017
Předmět:	Author identification text classification text preprocessing text representation Information retrieval text classification Computer science Turkish Bigram Mühendislik General Medicine Author identification language.human_language Parameter identification problem Identification (information) Statistical classification Engineering lcsh:TA1-2040 lcsh:Technology (General) language Preprocessor Sequential minimal optimization lcsh:T1-995 Ortak Disiplinler Representation (mathematics) lcsh:Engineering (General). Civil engineering (General) text preprocessing text representation
Zdroj:	Anadolu University Journal of Science and Technology. A : Applied Sciences and Engineering, Vol 18, Iss 1, Pp 218-224 (2017) Volume: 18, Issue: 1 218-224 Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering
ISSN:	2146-0205 1302-3160
Popis:	Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::52ac353d4fc2605e7d7f64a4e599b19f http://dergipark.gov.tr/aubtda/issue/28283/270276?publisher=anadolu Zobrazit plný text záznamu