Estimation of distance-based metrics for very large graphs with MinHash Signatures
Autor: | Gianluca Rossi, Giambattista Amati, Simone Angelini, Giorgio Gambosi, Paola Vocca |
---|---|
Rok vydání: | 2017 |
Předmět: |
Probabilistic Counters
Wilcoxon signed-rank test Computer science Approximate Estimation Probabilistic logic Approximation algorithm 02 engineering and technology MinHash HyperANF Settore ING-INF/01 - Elettronica Graph Harmonic analysis MinHash Signature Robustness (computer science) 020204 information systems Effective Diameter 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Algorithm Statistical hypothesis testing Parametric statistics |
Zdroj: | IEEE BigData |
DOI: | 10.1109/bigdata.2017.8257969 |
Popis: | We propose a highly efficient and effective algorithm to estimate metrics on very large graphs that are based on the neighborhood function: examples of such metrics are the (effective) diameter and (effective) radius or the average distance. In order to efficiently provide good approximations to the size of the neighborhood set of any node, we refer to the MinHash Signatures approach to derive compressed representations of large sparse datasets but preserving similarity. The technique introduced here, named MinHash Signature Estimation (MHSE), exploits the similarity between signatures to estimate the size of the neighborhood function. We show that MHSE is as effective as HyperANF, which is considered the state of art approach for the estimation of the effective diameter of a very large graph. Indeed, performing both parametric (t-test) and non-parametric (Wilcoxon) statistical tests on residuals for average distance, effective diameter and number of connected pairs, the p-values show that MHSE tends to be more statistically significant than HyperANF. On the other hand, we show that MHSE is a very simple and easily distributable algorithm. In addition, by the property of the signatures to preserve similarity between neighborhoods of nodes, the algorithm can be suitably applied to allow to search and estimate the overlapping size of the most similar neighborhood at different distances. |
Databáze: | OpenAIRE |
Externí odkaz: |