FDR2-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
Autor: | María José Basgall, Marcelo Naiouf, Alberto Fernández |
---|---|
Rok vydání: | 2021 |
Předmět: |
Big Data
TK7800-8360 Computer Networks and Communications Computer science Big data Feature selection 02 engineering and technology computer.software_genre preprocessing techniques Apache spark purl.org/becyt/ford/1 [https] Reduction (complexity) Preprocessing techniques big data Robustness (computer science) 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Electrical and Electronic Engineering Apache Spark Data reduction business.industry Sampling (statistics) purl.org/becyt/ford/1.2 [https] Classification classification Hardware and Architecture Control and Systems Engineering Signal Processing Scalability data reduction 020201 artificial intelligence & image processing Data mining Electronics business computer |
Zdroj: | Electronics Volume 10 Issue 15 CONICET Digital (CONICET) Consejo Nacional de Investigaciones Científicas y Técnicas instacron:CONICET Electronics, Vol 10, Iss 1757, p 1757 (2021) Digibug. Repositorio Institucional de la Universidad de Granada instname |
ISSN: | 2079-9292 |
Popis: | In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina Fil: Fernández, Alberto. Universidad de Granada; España |
Databáze: | OpenAIRE |
Externí odkaz: |