Popis: |
Record Linkage (ReL) is the task of identifying records from a pair of databases referring to the same realworld entity. This has many applications in organisations of all sizes where related data often exist in silos leading to inefficiency in data engineering and analytics applications as well as ineffectiveness of business applications (e.g., unable to personalise marketing campaigns).State-of-the-art (SOTA) machine learning and deep learning based ReL techniques use adaptive similarity measures and learn their relative contributions based on labeled data. However, we report here that they do not work with similar efficacy on industrial data owing to its fundamental differences such as magnitude of schema heterogeneity, need for leveraging structure of the data, lack of training data etc. Through our proposed system ‘ReLink’, we carefully mitigate these challenges and demonstrate that it not only significantly outperforms SOTA baselines on industrial datasets but also on majority of research benchmarks. ReLink introduces the notion of complete-linkage over attributes as well as uses hybrid feature spaces on lexical and semantic similarity measures using pre-trained models such as BERT. Going beyond empirical demonstration, we provide insights and prescriptive guidance on choice of ReL techniques in industrial settings from our observations and lessons learnt from the experience of transferring and deploying for real use-cases in a large financial services organization. |