TOP-Rank: A Novel Unsupervised Approach for Topic Prediction Using Keyphrase Extraction for Urdu Documents
Autor: | Natash Ali Mian, Ahmad Amin, Tahir Alyas, Muhammad Waseem Iqbal, Toqir Ahmad Rana, Mohammad Tubishat, Abbas Khalid |
---|---|
Rok vydání: | 2020 |
Předmět: |
Topic model
General Computer Science Arabic Computer science InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Feature extraction Urdu positional ranking 02 engineering and technology computer.software_genre Topic extraction Set (abstract data type) 020204 information systems 0202 electrical engineering electronic engineering information engineering General Materials Science Cluster analysis business.industry Rank (computer programming) General Engineering keyphrase extraction language.human_language Identification (information) Ranking top-rank ComputingMethodologies_DOCUMENTANDTEXTPROCESSING language 020201 artificial intelligence & image processing lcsh:Electrical engineering. Electronics. Nuclear engineering Artificial intelligence business lcsh:TK1-9971 computer topic prediction Natural language processing Sentence |
Zdroj: | IEEE Access, Vol 8, Pp 212675-212686 (2020) |
ISSN: | 2169-3536 |
Popis: | In Natural Language Processing (NLP), topic modeling is the technique to extract abstract information from documents with huge amount of text. This abstract information leads towards the identification of the topics in the document. One way to retrieve topics from documents is keyphrase extraction. Keyphrases are a set of terms which represent high level description of a document. Different techniques of keyphrase extraction for topic prediction have been proposed for multiple languages i.e. English, Arabic, etc. However, this area needs to be explored for other languages e.g. Urdu. Therefore, in this paper, a novel unsupervised approach for topic prediction for Urdu language has been introduced which is able to extract more significant information from the documents. For this purpose, the proposed TOP-Rank system extracts keywords from the document and ranks them according to their position in a sentence. These keywords along with their ranking scores are utilized to generate keyphrases by applying syntactic rules to extracts more meaningful topics. These keyphrases are ranked according to the keywords scores and re-ranked with respect to their positions in the document. Finally, our proposed model identifies top-ranked keyphrases as topical significance and keyphrase with the highest score is selected as the topic of the document. Experiments are performed on two different datasets and performance of the proposed system is compared with existing state-of-the-art techniques. Results have shown that our proposed system outperforms existing techniques and holds the ability to produce more meaningful topics. |
Databáze: | OpenAIRE |
Externí odkaz: |