Active Learning Based Similarity Filtering for Efficient and Effective Record Linkage

Autor: Thilina Ranbaduge, Charini Nanayakkara, Peter Christen
Rok vydání: 2021
Předmět:
Zdroj: Advances in Knowledge Discovery and Data Mining ISBN: 9783030757649
PAKDD (2)
DOI: 10.1007/978-3-030-75765-6_26
Popis: The limited analytical value of using individual databases on their own increasingly requires the integration of large and complex databases for advanced data analytics. Linking personal medical records with travel and immigration data, for example, will allow the effective management of pandemics such as the current COVID-19 outbreak by tracking potentially infected individuals and their contacts. One major challenge for accurate linkage of large databases is the quadratic or even higher computational complexities of many advanced linkage algorithms. In this paper we present a novel approach that, based on the expected number of true matches between two databases, applies active learning to remove compared record pairs that are likely non-matches before a computationally expensive classification or clustering algorithm is employed to classify record pairs. Unlike blocking and indexing techniques that are used to reduce the number of record pairs to be compared, using recursive binning on a data dimension such as time or space, our approach removes likely non-matching record pairs in each bin after their comparison. Experiments on two real-world databases show that similarity filtering can substantially reduce run time and improve precision, at the costs of a small reduction in recall, of the final linkage results.
Databáze: OpenAIRE