Deconstructing Domain Names to Reveal Latent Topics

Autor:	Cheryl J. Flynn, Kenneth E. Shirley, Wei Wang
Rok vydání:	2016
Předmět:	Topic model business.industry Computer science Text segmentation Supervised learning 02 engineering and technology Machine learning computer.software_genre 01 natural sciences Latent Dirichlet allocation Domain (software engineering) 010104 statistics & probability symbols.namesake Categorization 020204 information systems 0202 electrical engineering electronic engineering information engineering symbols Unsupervised learning Artificial intelligence 0101 mathematics Cluster analysis business computer Natural language processing
Zdroj:	DSAA
DOI:	10.1109/dsaa.2016.63
Popis:	Measurement of the lexical properties of domain names enables many types of relatively fast, lightweight web mining analyses. These include unsupervised learning tasks such as automatic categorization and clustering of websites, as well as supervised learning tasks, such as classifying websites as malicious or benign. In this paper we explore whether these tasks can be better accomplished by identifying semantically coherent groups of words in a large set of domain names using a combination of word segmentation and topic modeling methods. By segmenting domain names to generate a large set of new domain-level features, we compare three different unsupervised learning methods for identifying topics among domain name keywords: spherical k-means clustering (SKM), Latent Dirichlet Allocation (LDA), and the Biterm Topic Model (BTM). We successfully infer semantically coherent groups of words in two independent data sets, finding that BTM topics are quantitatively the most coherent. Using the BTM, we compare inferred topics across data sets and across time periods, and we also highlight instances of homophony within the topics. Finally, we show that the BTM topics can be used as features to improve the interpretability of a supervised learning model for the detection of malicious domain names. To our knowledge this is the first large-scale empirical analysis of the co-occurrence patterns of words within domain names.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::67a539d1cd3af740912ea201e3d7eaf0 https://doi.org/10.1109/dsaa.2016.63 Zobrazit plný text záznamu