Magellan

Autor: AnHai Doan, Pradap Konda, Kaushik Chandrasekhar, G C Paul Suganthan, Derek Paulsen, Philip Martinkus, Matthew Christie, Yash Govind
Rok vydání: 2020
Předmět:
Zdroj: Communications of the ACM. 63:83-91
ISSN: 1557-7317
0001-0782
DOI: 10.1145/3405476
Popis: Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build PyMatcher and CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration.
Databáze: OpenAIRE