Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density
Autor: | Etta D. Pisano, Sheela Agarwal, Katharina Hoebel, Jay B. Patel, Praveer Singh, Keith J. Dreyer, Ken Chang, Bibb Allen, Nathan Gaw, Andrew Beers, Laura Brink, Laura Coombs, Nishanth Thumbavanam Arun, Jayashree Kalpathy-Cramer, Meesam Shah, Mike Tilkin |
---|---|
Rok vydání: | 2020 |
Předmět: |
Digital mammography
Computer science Breast Neoplasms BI-RADS Machine learning computer.software_genre Crowdsourcing 030218 nuclear medicine & medical imaging 03 medical and health sciences Deep Learning 0302 clinical medicine Artificial Intelligence Humans Radiology Nuclear Medicine and imaging Generalizability theory Breast Density Artificial neural network business.industry Deep learning Sampling (statistics) Data set 030220 oncology & carcinogenesis Female Artificial intelligence business computer Mammography |
Zdroj: | Journal of the American College of Radiology. 17:1653-1662 |
ISSN: | 1546-1440 |
DOI: | 10.1016/j.jacr.2020.05.015 |
Popis: | Objective We developed deep learning algorithms to automatically assess BI-RADS breast density. Methods Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting. Results Our best-performing algorithm achieved good agreement with radiologists who were qualified interpreters of mammograms, with a four-class κ of 0.667. When training was performed with randomly sampled images from the data set versus sampling equal number of images from each density category, the model predictions were biased away from the low-prevalence categories such as extremely dense breasts. The net result was an increase in sensitivity and a decrease in specificity for predicting dense breasts for equal class compared with random sampling. We also found that the performance of the model degrades when we evaluate on digital mammography data formats that differ from the one that we trained on, emphasizing the importance of multi-institutional training sets. Lastly, we showed that crowdsourced annotations, including those from attendees who routinely read mammograms, had higher agreement with our algorithm than with the original interpreting radiologists. Conclusion We demonstrated the possible parameters that can influence the performance of the model and how crowdsourcing can be used for evaluation. This study was performed in tandem with the development of the ACR AI-LAB, a platform for democratizing artificial intelligence. |
Databáze: | OpenAIRE |
Externí odkaz: |