Diftong : a tool for validating big data workflows

Autor: Erik Zeitler, Johan Petrini, Raya Rizk, Steve McKeever
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Information Systems and Management
Correctness
lcsh:Computer engineering. Computer hardware
Computer Networks and Communications
Computer science
Process (engineering)
Biblioteks- och informationsvetenskap
Big data
Human error
Data validation
lcsh:TK7885-7895
02 engineering and technology
Turnaround time
lcsh:QA75.5-76.95
Information Studies
020204 information systems
0202 electrical engineering
electronic engineering
information engineering

Big data workflow
lcsh:T58.5-58.64
business.industry
lcsh:Information technology
Data quality
Big data validation tool
Workflow
Hardware and Architecture
020201 artificial intelligence & image processing
lcsh:Electronic computers. Computer science
Scenario testing
Big data validation process
Software engineering
business
Information Systems
Data testing
Zdroj: Journal of Big Data, Vol 6, Iss 1, Pp 1-27 (2019)
Popis: Data validation is about verifying the correctness of data. When organisations update and refine their data transformations to meet evolving requirements, it is imperative to ensure that the new version of a workflow still produces the correct output. We motivate the need for workflows and describe the implementation of a validation tool called Diftong. This tool compares two tabular databases resulting from different versions of a workflow to detect and prevent potential unwanted alterations. Row-based and column-based statistics are used to quantify the results of the database comparison. Diftong was shown to provide accurate results in test scenarios, bringing benefits to companies that need to validate the outputs of their workflows. By automating this process, the risk of human error is also eliminated. Compared to the more labour-intensive manual alternative, it has the added benefit of improved turnaround time for the validation process. Together this allows for a more agile way of updating data transformation workflows.
Databáze: OpenAIRE