Popis: |
Many real-world applications, such as bioinformatics, data mining, pattern recognition, and social network analysis, benefit from efficient solutions for the graph similarity search problem. Existing methods have limited scalability when they handle the large graph databases, for example, those with millions or billions of graphs that cannot fit in main memory. In this paper, we study the problem of graph similarity search under the graph edit distance constraint in external memory. We present an efficient framework for arbitrary q-gram-based representations of a graph. Specifically, we propose a q-gram matrix index stored in hybrid layout in external memory to achieve efficient query processing, by converting the q-gram counting filter into a sparse matrix-vector multiplication problem. Furthermore, we also boost the query performance by transforming the global filter to a 2-D query rectangle, which allows us to perform a query in a reduced region, significantly reducing the number of query I/Os in practice. Extensive experiments on real data sets confirm that 1) our method can compete with the state-of-the-art in-memory methods in index size and filtering ability, and outperform them on scalability of coping with the PubChem data set including 25 million chemical structure graphs and 2) compared with the popular q-gram-based external inverted index, our external index structure needs much fewer number of query I/Os on the PubChem data set. |