ProgressER
Autor: | Yasser Altowim, Sharad Mehrotra, Dmitri V. Kalashnikov |
---|---|
Rok vydání: | 2018 |
Předmět: |
Similarity (geometry)
General Computer Science Computer science Process (computing) Window (computing) 02 engineering and technology Resolution (logic) Object (computer science) computer.software_genre Constraint (information theory) Set (abstract data type) 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Data mining Latency (engineering) computer |
Zdroj: | ACM Transactions on Knowledge Discovery from Data. 12:1-45 |
ISSN: | 1556-472X 1556-4681 |
DOI: | 10.1145/3154410 |
Popis: | Entity resolution (ER) is the process of identifying which entities in a dataset refer to the same real-world object. In relational ER, the dataset consists of multiple entity-sets and relationships among them. Such relationships cause the resolution of some entities to influence the resolution of other entities. For instance, consider a relational dataset that consists of a set of research paper entities and a set of venue entities. In such a dataset, deciding that two research papers are the same may trigger the fact that their venues are also the same. This article proposes a progressive approach to relational ER, named ProgressER, that aims to produce the highest quality result given a constraint on the resolution budget, specified by the user. Such a progressive approach is useful for many emerging analytical applications that require low latency response (and thus cannot tolerate delays caused by cleaning the entire dataset) and/or in situations where the underlying resources are constrained or costly to use. To maximize the quality of the result, ProgressER follows an adaptive strategy that periodically monitors and reassesses the resolution progress to determine which parts of the dataset should be resolved next and how they should be resolved. More specifically, ProgressER divides the input budget into several resolution windows and analyzes the resolution progress at the beginning of each window to generate a resolution plan for the current window. A resolution plan specifies which blocks of entities and which entity pairs within blocks need to be resolved during the plan execution phase of that window. In addition, ProgressER specifies, for each identified pair of entities, the order in which the similarity functions should be applied on the pair. Such an order plays a significant role in reducing the overall cost because applying the first few functions in this order might be sufficient to resolve the pair. The empirical evaluation of ProgressER demonstrates its significant advantage in terms of progressiveness over the traditional ER techniques for the given problem settings. |
Databáze: | OpenAIRE |
Externí odkaz: |