PSH: A probabilistic signature hash method with hash neighborhood candidate generation for fast edit-distance string comparison on big data
Autor: | Eduard C. Dragut, Joseph Jupin, Justin Y. Shi |
---|---|
Rok vydání: | 2016 |
Předmět: |
Theoretical computer science
020205 medical informatics Computer science String (computer science) Hash function Probabilistic logic 02 engineering and technology Approximate string matching Prefix 03 medical and health sciences 0302 clinical medicine Metric (mathematics) 0202 electrical engineering electronic engineering information engineering Edit distance 030212 general & internal medicine String metric Algorithm Double hashing |
Zdroj: | IEEE BigData |
DOI: | 10.1109/bigdata.2016.7840596 |
Popis: | Approximate string matching is essential because data entry errors are unavoidable. Approximately 80% of data entry errors are a single edit distance from the correct entry. We introduce Probabilistic Signature Hashing (PSH), a hash-based filter and verify method to enhance the performance of edit distance comparison of relatively short strings with proven no loss to accuracy. Our experiments show that the proposed method is almost 6800 orders of magnitude faster than Damerau-Levenshtein (DL) edit distance and produces the same exact results. This method combines prefix pruning, string bit signatures and hashing to provide very fast edit-distance comparison with no loss of true positive matches. PSH will provide substantial performance gains as a string comparison metric when used in place of DL. |
Databáze: | OpenAIRE |
Externí odkaz: |