Document Classification Using Enhanced Grid Based Clustering Algorithm

Autor: Mohamed Waleed Fakhr, Mohamed Ahmed Rashad, Hesham El-Deeb
Rok vydání: 2014
Předmět:
Zdroj: Lecture Notes in Electrical Engineering ISBN: 9783319067636
DOI: 10.1007/978-3-319-06764-3_27
Popis: Automated document clustering is an important text mining task especially with the rapid growth of the number of online documents present in Arabic language. Text clustering aims to automatically assign the text to a predefined cluster based on linguistic features. This research proposes an enhanced grid based clustering algorithm. The main purpose of this algorithm is to divide the data space into clusters with arbitrary shape. These clusters are considered as dense regions of points in the data space that are separated by regions of low density representing noise. Also it deals with making clustering the data set with multi-densities and assigning noise and outliers to the closest category. This will reduce the time complexity. Unclassified documents are preprocessed by removing stops words and extracting word root used to reduce the dimensionality of feature vectors of documents. Each document is then represented as a vector of words and their frequencies. The accuracy is presented according to time consumption and the percentage of successfully clustered instances. The results of the experiments that were carried out on an in-house collected Arabic text have proven its effectiveness of the enhanced clustering algorithm with average accuracy 89 %.
Databáze: OpenAIRE