A Practical Approach on Cleaning-Up Large Data Sets

Autor: Dragos Teodor Gavrilut, Dumitru Bogdan Prelipcean, Marius Barat
Rok vydání: 2014
Předmět:
Zdroj: SYNASC
DOI: 10.1109/synasc.2014.45
Popis: In this paper we propose a noise detection system based on similarities between instances. Having a data set with instances that belongs to multiple classes, a noise instance denotes a wrongly classified record. The similarity between different labeled instances is determined computing distances between them using several metrics among the standard ones. In order to ensure that this approach is computational feasible for very large data sets, we compute distances between pairs of different labels instances that have a certain degree of similarity. This speed-up is possible through a new clustering method called BDT Clustering presented within this paper, which is based on a supervised learning algorithm.
Databáze: OpenAIRE