Popis: |
Given a large data collection, entity resolution is to find the records referring to the same entity. A crucial step of entity resolution is to compute the similarity between records. Without careful design, sometimes it has to compare all characters in two records to get a small similarity value. In this paper, we propose a novel method based on waves of records, which is a sequence of frequencies of characters and the same frequency of different characters is considered as different. The structure Wave in our algorithm will decrease comparing times sharply in computing similarity by two techniques: filtering the record pairs without the similar waves, and estimating the maximum similarity of the remaining part of records can be, and if it is too small, the algorithm can end the computation as early as possible without false negative. We demonstrate the effectiveness of our algorithm using a thorough experimental evaluation over real-life data sets. |