A Practical Approach on Cleaning-Up Large Data Sets
Autor: | Dragos Teodor Gavrilut, Dumitru Bogdan Prelipcean, Marius Barat |
---|---|
Rok vydání: | 2014 |
Předmět: |
Clustering high-dimensional data
business.industry Computer science Correlation clustering Constrained clustering Pattern recognition computer.software_genre Data stream clustering CURE data clustering algorithm Canopy clustering algorithm Artificial intelligence Instance-based learning Data mining business Cluster analysis computer |
Zdroj: | SYNASC |
DOI: | 10.1109/synasc.2014.45 |
Popis: | In this paper we propose a noise detection system based on similarities between instances. Having a data set with instances that belongs to multiple classes, a noise instance denotes a wrongly classified record. The similarity between different labeled instances is determined computing distances between them using several metrics among the standard ones. In order to ensure that this approach is computational feasible for very large data sets, we compute distances between pairs of different labels instances that have a certain degree of similarity. This speed-up is possible through a new clustering method called BDT Clustering presented within this paper, which is based on a supervised learning algorithm. |
Databáze: | OpenAIRE |
Externí odkaz: |