Pre Processing Techniques for Arabic Documents Clustering

Autor:	Alhanjouri, Mohammed A.
Rok vydání:	2017
Předmět:	ComputingMethodologies_PATTERNRECOGNITION ii background -arabic text mining ComputingMethodologies_DOCUMENTANDTEXTPROCESSING arabic document clustering term weighting arabic text preprocessing vector space mode (vsm) arabic morphological analysis
Popis:	Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od______4294::ca34f7703221defc55e40fcda6bfc8e1 http://www.ijemr.net/DOC/PreProcessingTechniquesForArabicDocumentsClustering.PDF Zobrazit plný text záznamu