FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce
Autor: | Decai Sun, Xiaoxia Wang |
---|---|
Rok vydání: | 2019 |
Předmět: |
Computer science
String (computer science) Joins 02 engineering and technology computer.software_genre Partition (database) Similarity (network science) Filter (video) 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Edit distance String metric Algorithm computer Data integration |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783030242640 ICAIS (2) |
DOI: | 10.1007/978-3-030-24265-7_22 |
Popis: | Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero. |
Databáze: | OpenAIRE |
Externí odkaz: |