Deconstructing Domain Names to Reveal Latent Topics
Autor: | Cheryl J. Flynn, Kenneth E. Shirley, Wei Wang |
---|---|
Rok vydání: | 2016 |
Předmět: |
Topic model
business.industry Computer science Text segmentation Supervised learning 02 engineering and technology Machine learning computer.software_genre 01 natural sciences Latent Dirichlet allocation Domain (software engineering) 010104 statistics & probability symbols.namesake Categorization 020204 information systems 0202 electrical engineering electronic engineering information engineering symbols Unsupervised learning Artificial intelligence 0101 mathematics Cluster analysis business computer Natural language processing |
Zdroj: | DSAA |
DOI: | 10.1109/dsaa.2016.63 |
Popis: | Measurement of the lexical properties of domain names enables many types of relatively fast, lightweight web mining analyses. These include unsupervised learning tasks such as automatic categorization and clustering of websites, as well as supervised learning tasks, such as classifying websites as malicious or benign. In this paper we explore whether these tasks can be better accomplished by identifying semantically coherent groups of words in a large set of domain names using a combination of word segmentation and topic modeling methods. By segmenting domain names to generate a large set of new domain-level features, we compare three different unsupervised learning methods for identifying topics among domain name keywords: spherical k-means clustering (SKM), Latent Dirichlet Allocation (LDA), and the Biterm Topic Model (BTM). We successfully infer semantically coherent groups of words in two independent data sets, finding that BTM topics are quantitatively the most coherent. Using the BTM, we compare inferred topics across data sets and across time periods, and we also highlight instances of homophony within the topics. Finally, we show that the BTM topics can be used as features to improve the interpretability of a supervised learning model for the detection of malicious domain names. To our knowledge this is the first large-scale empirical analysis of the co-occurrence patterns of words within domain names. |
Databáze: | OpenAIRE |
Externí odkaz: |