Improving Arabic Text Categorization using Normalization and Stemming Techniques
Autor: | Hamdy M. Mousa, Rouhia M. Sallam, Mahmoud Hussein |
---|---|
Rok vydání: | 2016 |
Předmět: |
Normalization (statistics)
Arabic Computer science business.industry Pattern recognition 02 engineering and technology 010402 general chemistry computer.software_genre 01 natural sciences language.human_language 0104 chemical sciences Text categorization Categorization 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing Artificial intelligence Data mining business computer |
Zdroj: | International Journal of Computer Applications. 135:38-43 |
ISSN: | 0975-8887 |
DOI: | 10.5120/ijca2016908328 |
Popis: | Categorization is a technique for assigning documents based on their contents to one or more pre-defined categories. Achieving highest categorization accuracy remains one of the major challenges and it is also time consuming. We proposed approach to tackle these challenges. The proposed approach uses Frequency Ratio Accumulation Method (FRAM) as a classifier. Its features are represented using bag of word technique and an improved Term Frequency (TF) technique is used in features selection. The proposed approach is tested with known datasets. The experiments are done without both of normalization and stemming, with one of them, and with both of them. The obtained results of proposed approach are generally improved compared to existing techniques.The performance attributes of proposed Arabic Text Categorization approach were considered: Accuracy, Recall, Precision and F-measure (F1). The averages of the obtained results are 97.50%, 97.50%, 97.51%, and 97.49% respectively using normalization. Keywordstext categorization, Frequency ratio accumulation method (FRAM), Bag-Of-Word (BOW), Features selection, Term and document frequency. |
Databáze: | OpenAIRE |
Externí odkaz: |