Applying Cluster Refinement to Improve Crowd-Based Data Duplicate Detection Approach
Autor: | Rui Xi, Lawrence Tandoh, Michael Y. Kpiebaareh, Moses J. Eghan, Maame G. Asante-Mensah, Mengshu Hou, Charles Roland Haruna, Barbie Eghan-Yartel |
---|---|
Rok vydání: | 2019 |
Předmět: |
General Computer Science
Computer science business.industry General Engineering minimization approach Crowdsourcing computer.software_genre Duplicate detection entity reconciliation Cluster (physics) Cluster refinement triangular split and merger operations crowdsourcing General Materials Science lcsh:Electrical engineering. Electronics. Nuclear engineering Data mining business lcsh:TK1-9971 computer |
Zdroj: | IEEE Access, Vol 7, Pp 77426-77435 (2019) |
ISSN: | 2169-3536 |
DOI: | 10.1109/access.2019.2920667 |
Popis: | In this paper, we present an extension on a hybrid-based deduplication technique in entity reconciliation (ER), by proposing an algorithm that builds clusters upon receiving a pre-specified K number of clusters, and second developing a crowd-based procedure for refining the results of the clusters produced after the clustering generation phases. With the clusters refined, we aim to minimize the cost metric Λ'(R) of the solitary and compound cluster generation algorithms, to achieve an improved and efficient deduplication method, to have an increase in accuracy in identifying duplicate records, and finally, further reduce the crowdsourcing overheads incurred. In this paper, in the experiments, we made use of three datasets commonly known to hybrid-based deduplication such as paper, product, and restaurant. The performance results and evaluations demonstrate clear superiority to the methods compared with our work offering low-crowdsourcing cost and high accuracy of deduplication, as well as better deduplication efficiency due to the clusters being refined. |
Databáze: | OpenAIRE |
Externí odkaz: |