Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models
Autor: | Jacob Pfau, Nhat Anh Cao, Albert T. Young, Max Y. von Franque, Benjamin V. Wu, Kristen Fernandez, Rasika Reddy, Andrew Tam, Rachel R. Wu, Arjun Johal, Raj P. Fadadu, Michael J. Keiser, Maria L. Wei, Jennifer Y. Chen, Juan A. Vasquez |
---|---|
Rok vydání: | 2021 |
Předmět: |
Computer science
Computer applications to medicine. Medical informatics R858-859.7 Medicine (miscellaneous) Bioengineering Health Informatics Stress testing (software) Convolutional neural network Article Cancer screening 030207 dermatology & venereal diseases 03 medical and health sciences 0302 clinical medicine Health Information Management Robustness (computer science) Diagnosis Generalizability theory Melanoma 030304 developmental biology 0303 health sciences Receiver operating characteristic Contextual image classification business.industry Translational research Computer Science Applications Test (assessment) Medical imaging Artificial intelligence business Performance metric |
Zdroj: | NPJ digital medicine, vol 4, iss 1 npj Digital Medicine, Vol 4, Iss 1, Pp 1-8 (2021) NPJ Digital Medicine |
ISSN: | 2398-6352 |
DOI: | 10.1038/s41746-020-00380-6 |
Popis: | Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational “stress tests”. Our goal was to create a proxy environment in which to comprehensively test the generalizability of off-the-shelf CNNs developed without training or evaluation protocols specific to individual clinics. We found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g., rotation). Such transformations resulted in false positive or negative predictions for 6.5–22% of skin lesions across test datasets. Our findings indicate that models meeting conventionally reported metrics need further validation with computational stress tests to assess clinic readiness. |
Databáze: | OpenAIRE |
Externí odkaz: |