FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce

Autor: Decai Sun, Xiaoxia Wang
Rok vydání: 2019
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783030242640
ICAIS (2)
DOI: 10.1007/978-3-030-24265-7_22
Popis: Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero.
Databáze: OpenAIRE