A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data
Autor: | Xin-Hua Hu, Venkat N. Gudivada, Junhua Ding |
---|---|
Rok vydání: | 2021 |
Předmět: |
Information Systems and Management
business.industry Computer science Active learning (machine learning) Big data Online machine learning Confusion matrix 020207 software engineering 02 engineering and technology Machine learning computer.software_genre Software Computational learning theory 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Software verification and validation Data mining Metamorphic testing Artificial intelligence business computer Information Systems |
Zdroj: | IEEE Transactions on Big Data. 7:451-467 |
ISSN: | 2372-2096 |
DOI: | 10.1109/tbdata.2017.2680460 |
Popis: | Big data validation and system verification are crucial for ensuring the quality of big data applications. However, a rigorous technique for such tasks is yet to emerge. During the past decade, we have developed a big data system called CMA for investigating the classification of biological cells based on cell morphology which is captured in diffraction images. CMA includes a collection of scientific software tools, machine learning algorithms, and a large-scale cell image repository. In order to ensure the quality of big data system CMA, we developed a framework for rigorously validating the massive scale image data as well as adequately verifying both the software tools and machine learning algorithms. The validation of big data is conducted by iteratively selecting the data using a machine learning approach. An experimental approach guided by a feature selection algorithm is introduced in the framework to select an optimal feature set for improving the machine learning performance. The verification of software and algorithms is developed on the iterative metamorphic testing approach due to the non-testable property of the software and algorithms. A machine learning approach is introduced for developing test oracles iteratively to ensure the adequacy of the test coverage criteria. Performance of the machine learning algorithm is evaluated with a stratified N-fold cross validation and confusion matrix. We describe the design of the proposed big data verification and validation framework with CMA as the case study, and demonstrate its effectiveness through verifying and validating the dataset, the software and the algorithms in CMA. |
Databáze: | OpenAIRE |
Externí odkaz: |