Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling
Autor: | Sucheta Soundarajan, Katchaguy Areekijseree, Ricky Laishram |
---|---|
Rok vydání: | 2016 |
Předmět: |
Degree (graph theory)
Computer science Node (networking) Real-time computing Social network analysis (criminology) Sampling (statistics) 02 engineering and technology Crawling computer.software_genre Telecommunications network Electronic mail symbols.namesake 020204 information systems 0202 electrical engineering electronic engineering information engineering symbols 020201 artificial intelligence & image processing Algorithm design Data mining Web crawler Baseline (configuration management) Algorithm Social network analysis computer Gibbs sampling |
Zdroj: | IEEE BigData INFOCOM Workshops |
DOI: | 10.1109/bigdata.2016.7841092 |
Popis: | When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the ‘Friends’ and ‘Follower’ connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15% to 170.2% more nodes than the closest baseline algorithm for the different datasets considered. |
Databáze: | OpenAIRE |
Externí odkaz: |