Indeterministic Handling of Uncertain Decisions in Duplicate Detection

Autor:	Panse, Fabian, van Keulen, Maurice, Ritter, Norbert
Přispěvatelé:	Databases (Former)
Rok vydání:	2010
Předmět:	IR-71703 METIS-270837 EWI-17967 DB-SDI: SCHEMA AND DATA INTEGRATION
Popis:	In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=narcis______::57ce589e511d70d09740f79070b1890d https://research.utwente.nl/en/publications/indeterministic-handling-of-uncertain-decisions-in-duplicate-detection(5a9784d9-8612-483b-a148-0153c1e7ff47).html Zobrazit plný text záznamu