Popis: |
The uniform resource locator (URL) conveys essential information about a page’s topic, authority, and security, which significantly influences its ranking in search engine results. However, many existing URL classification methods used for real-time online inference face challenges related to time and memory complexity during preprocessing, processing, and inference stages. In environments where quick decision-making is crucial, such as cybersecurity or digital marketing, slow or resource-intensive classification processes can limit the ability to prioritize and select URLs effectively. This study presents QuickCharNet, a novel URL classification architecture that combines character-level convolution with token-level representation to improve efficiency and generalizability for real-time URL inference. Key contributions of QuickCharNet include the use of max and mean pooling techniques to aggregate character embeddings into token embeddings, the exploration of sub-word tokenizers for optimal URL representation, and a comparative analysis of five models using metrics such as accuracy, F1-Score, and t-distributed Stochastic Neighbor Embedding (t-SNE) visualizations. By prioritizing user experience and safety, this research aims to enhance the accuracy of topic classification based on URL positioning in search engine results and evaluate the likelihood of spam. Ultimately, the findings support developers in efficiently identifying and addressing URL-related issues for improved search engine optimization. QuickCharNet was trained on a dataset developed for this study and two benchmark datasets. Experiments revealed optimal settings for URL classification and spam detection, resulting in a 4.92% improvement in topic classification and a 1% improvement in spam detection. These results emphasize the significance of URLs in search engine optimization: well-named URLs enable better topic classification, increasing the likelihood of appearing on the first page of search engine results by 4.92%. Conversely, URLs identified as spam face a higher chance of lower rankings, impacting their first-page visibility by 1%. Link to the source codes: https://github.com/FardinRastakhiz/QuickCharNet. |