Magellan
Autor: | AnHai Doan, Pradap Konda, Kaushik Chandrasekhar, G C Paul Suganthan, Derek Paulsen, Philip Martinkus, Matthew Christie, Yash Govind |
---|---|
Rok vydání: | 2020 |
Předmět: |
Matching (statistics)
General Computer Science Computer science business.industry Big data Interoperability Cloud computing 02 engineering and technology Data science Field (computer science) Domain (software engineering) 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing business |
Zdroj: | Communications of the ACM. 63:83-91 |
ISSN: | 1557-7317 0001-0782 |
DOI: | 10.1145/3405476 |
Popis: | Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build PyMatcher and CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration. |
Databáze: | OpenAIRE |
Externí odkaz: |