PSH: A probabilistic signature hash method with hash neighborhood candidate generation for fast edit-distance string comparison on big data

Autor: Eduard C. Dragut, Joseph Jupin, Justin Y. Shi
Rok vydání: 2016
Předmět:
Zdroj: IEEE BigData
DOI: 10.1109/bigdata.2016.7840596
Popis: Approximate string matching is essential because data entry errors are unavoidable. Approximately 80% of data entry errors are a single edit distance from the correct entry. We introduce Probabilistic Signature Hashing (PSH), a hash-based filter and verify method to enhance the performance of edit distance comparison of relatively short strings with proven no loss to accuracy. Our experiments show that the proposed method is almost 6800 orders of magnitude faster than Damerau-Levenshtein (DL) edit distance and produces the same exact results. This method combines prefix pruning, string bit signatures and hashing to provide very fast edit-distance comparison with no loss of true positive matches. PSH will provide substantial performance gains as a string comparison metric when used in place of DL.
Databáze: OpenAIRE