Demographic Reporting in Publicly Available Chest Radiograph Data Sets: Opportunities for Mitigating Sex and Racial Disparities in Deep Learning Models

Autor: Paul H. Yi, Tae Kyung Kim, Eliot Siegel, Noushin Yahyavi-Firouz-Abadi
Rok vydání: 2022
Předmět:
Zdroj: Journal of the American College of Radiology. 19:192-200
ISSN: 1546-1440
Popis: Data sets with demographic imbalances can introduce bias in deep learning models and potentially amplify existing health disparities. We evaluated the reporting of demographics and potential biases in publicly available chest radiograph (CXR) data sets.We reviewed publicly available CXR data sets available on February 1, 2021, with100 CXRs and performed a thorough search of various repositories, including Radiopaedia and Kaggle. For each data set, we recorded the total number of images and whether the data set reported demographic variables (age, race or ethnicity, sex, insurance status) in aggregate and on an image-level basis.Twenty-three CXR data sets were included (range, 105-371,858 images). Most data sets reported demographics in some form (19 of 23; 82.6%) and on an image level (17 of 23; 73.9%). The majority reported age (19 of 23; 82.6%) and sex (18 of 23; 78.2%), but a minority reported race or ethnicity (2 of 23; 8.7%) and insurance status (1 of 23; 4.3%). Of the 13 data sets with sex distribution readily available, the average breakdown was 55.2% male subjects, ranging from 47.8% to 69.7% male representation. Of these, 8 (61.5%) overrepresented male subjects and 5 (38.5%) overrepresented female subjects.Although most publicly available CXR data sets report age and sex on an image-basis level, few report race or ethnicity and insurance status. Furthermore, these data sets frequently underrepresent one of the sexes, more frequently the female sex. We recommend that data sets report standard demographic variables, and when possible, balance demographic representation to mitigate bias. Furthermore, for researchers using these data sets, we recommend that attention be paid to balancing demographic labels in addition to disease labels, as well as developing training methods that can account for these imbalances.
Databáze: OpenAIRE