Popis: |
Classification of web pages is the first step of web page ranking (or we can call it indexing), one of the most common ways to achieve indexing process is clustering that pages into groups as per the similarity, whenever the misclassification is less, the result will be perfect. Moreover, clustering is a collection of algorithms that dive the data into groups related to each other. Thus, we chose Microsoft learn to rank dataset, to achieve the analysis and model building on it, this dataset is specially designed for researches in this field, and it has huge and different information about ranking process. Because of the quantity of the information, we chose randomly 16,015 observations only from MSLR-WEB30K_2 _ fold 1, in this study according to the ability of our hardware, and the algorithms of analysis, some of algorithms which were used in analysis (determine the optimal number of clusters) cannot handle the huge quantity of observations. Hence, in this paper, we are going to use clustering analysis to improve the web search ranking using principle component analysis (PCA) with root main square error as a feature reduction technique to compute the errors rate and the accuracy of the model result to get the best number of attributes; this process was achieved with cross-validation approach using extreme gradient boost algorithm as a training model to estimate the sum of errors during training operation. |