TOP-Rank: A Novel Unsupervised Approach for Topic Prediction Using Keyphrase Extraction for Urdu Documents

Autor: Natash Ali Mian, Ahmad Amin, Tahir Alyas, Muhammad Waseem Iqbal, Toqir Ahmad Rana, Mohammad Tubishat, Abbas Khalid
Rok vydání: 2020
Předmět:
Topic model
General Computer Science
Arabic
Computer science
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
Feature extraction
Urdu positional ranking
02 engineering and technology
computer.software_genre
Topic extraction
Set (abstract data type)
020204 information systems
0202 electrical engineering
electronic engineering
information engineering

General Materials Science
Cluster analysis
business.industry
Rank (computer programming)
General Engineering
keyphrase extraction
language.human_language
Identification (information)
Ranking
top-rank
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
language
020201 artificial intelligence & image processing
lcsh:Electrical engineering. Electronics. Nuclear engineering
Artificial intelligence
business
lcsh:TK1-9971
computer
topic prediction
Natural language processing
Sentence
Zdroj: IEEE Access, Vol 8, Pp 212675-212686 (2020)
ISSN: 2169-3536
Popis: In Natural Language Processing (NLP), topic modeling is the technique to extract abstract information from documents with huge amount of text. This abstract information leads towards the identification of the topics in the document. One way to retrieve topics from documents is keyphrase extraction. Keyphrases are a set of terms which represent high level description of a document. Different techniques of keyphrase extraction for topic prediction have been proposed for multiple languages i.e. English, Arabic, etc. However, this area needs to be explored for other languages e.g. Urdu. Therefore, in this paper, a novel unsupervised approach for topic prediction for Urdu language has been introduced which is able to extract more significant information from the documents. For this purpose, the proposed TOP-Rank system extracts keywords from the document and ranks them according to their position in a sentence. These keywords along with their ranking scores are utilized to generate keyphrases by applying syntactic rules to extracts more meaningful topics. These keyphrases are ranked according to the keywords scores and re-ranked with respect to their positions in the document. Finally, our proposed model identifies top-ranked keyphrases as topical significance and keyphrase with the highest score is selected as the topic of the document. Experiments are performed on two different datasets and performance of the proposed system is compared with existing state-of-the-art techniques. Results have shown that our proposed system outperforms existing techniques and holds the ability to produce more meaningful topics.
Databáze: OpenAIRE