Clustering Analysis to Improve Web Search Ranking Using PCA and RMSE

Autor:	Khalid Q. Shafal, Mohammed A. Ko’adan, Mohammed A. Bamatraf
Rok vydání:	2020
Předmět:	Similarity (network science) Computer science Search engine indexing Web page Principal component analysis Feature (machine learning) Data mining Cluster analysis computer.software_genre computer Model building Ranking (information retrieval)
Zdroj:	Advances on Smart and Soft Computing ISBN: 9789811560477
DOI:	10.1007/978-981-15-6048-4_9
Popis:	Classification of web pages is the first step of web page ranking (or we can call it indexing), one of the most common ways to achieve indexing process is clustering that pages into groups as per the similarity, whenever the misclassification is less, the result will be perfect. Moreover, clustering is a collection of algorithms that dive the data into groups related to each other. Thus, we chose Microsoft learn to rank dataset, to achieve the analysis and model building on it, this dataset is specially designed for researches in this field, and it has huge and different information about ranking process. Because of the quantity of the information, we chose randomly 16,015 observations only from MSLR-WEB30K_2 _ fold 1, in this study according to the ability of our hardware, and the algorithms of analysis, some of algorithms which were used in analysis (determine the optimal number of clusters) cannot handle the huge quantity of observations. Hence, in this paper, we are going to use clustering analysis to improve the web search ranking using principle component analysis (PCA) with root main square error as a feature reduction technique to compute the errors rate and the accuracy of the model result to get the best number of attributes; this process was achieved with cross-validation approach using extreme gradient boost algorithm as a training model to estimate the sum of errors during training operation.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::73c4d7588125c713320a8e424d10291e https://doi.org/10.1007/978-981-15-6048-4_9 Zobrazit plný text záznamu