Cleaning Data with Forbidden Itemsets
Autor: | Joeri Rammelaere, Floris Geerts, Bart Goethals |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2017 |
Předmět: |
Computer. Automation
Dirty data Lift (data mining) Computer science InformationSystems_DATABASEMANAGEMENT 02 engineering and technology computer.software_genre Maintenance engineering 020204 information systems Data quality 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Data mining computer Computer Science::Databases |
Zdroj: | IEEE 33rd International Conference on Data Engineering (ICDE), APR 19-22, 2017, San Diego, CA 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017) ICDE |
ISSN: | 1084-4627 |
Popis: | Methods for cleaning dirty data typically rely on additional information about the data, such as user-specified constraints that specify when a database is dirty. These constraints often involve domain restrictions and illegal value combinations. Traditionally, a database is considered clean if all constraints are satisfied. However, many real-world scenario's only have a dirty database available. In such a context, we adopt a dynamic notion of data quality, in which the data is clean if an error discovery algorithm does not find any errors. We introduce forbidden itemsets which capture unlikely value co-occurrences in dirty data, and we derive properties of the lift measure to provide an efficient algorithm for mining low lift forbidden itemsets. We further introduce a repair method which guarantees that the repaired database does not contain any low lift forbidden itemsets. The algorithm uses nearest neighbor imputation to suggest possible repairs. Optional user interaction can easily be integrated into the proposed cleaning method. Evaluation on real-world data shows that errors are typically discovered with high precision, while the suggested repairs are of good quality and do not introduce new forbidden itemsets, as desired. |
Databáze: | OpenAIRE |
Externí odkaz: |