DQLearn : A Toolkit for Structured Data Quality Learning
Autor: | Nianjun Zhou, Shrey Shrivastava, Arun Iyengar, Dhaval Patel, Anuradha Bhamidipaty |
---|---|
Rok vydání: | 2020 |
Předmět: |
business.industry
Process (engineering) Computer science 020209 energy Big data 02 engineering and technology Data science Automation Workflow 020401 chemical engineering Data quality Data integrity Benchmark (surveying) 0202 electrical engineering electronic engineering information engineering Data analysis 0204 chemical engineering business |
Zdroj: | IEEE BigData |
Popis: | Data Quality (DQ) has been one of the key focuses as Data Analytics and Artificial Intelligence (AI) fields continue to grow. Yet, data quality analysis has mostly been a disjointed, ad-hoc, and cumbersome process in the overall data analysis workflow. There have been ongoing attempts to formalize this process, but the solutions that have come out are not universally applicable. Most of the proposed solutions try to address the problem of data quality from a limited perspective and suc-cessfully address only a subset of all challenges. These solutions fail to translate to other domains due to a lack of structure. In this paper, we present DQLearn, a toolkit for structured data quality learning. We start by presenting the core principle on which we build our library and introduce the four components that provide a solid base to address the needs of the data quality problem. Then, we showcase our automation structure - "Workflows", and the two optimization techniques equipped with it, that help the users to structure their learning problem very easily. Next, we discuss four important scenarios of the DQ Workflows in the overall life-cycle. Finally, we demonstrate the utility of the proposed toolkit with public datasets and show benchmark results from optimization experiments. |
Databáze: | OpenAIRE |
Externí odkaz: |