Systematic misestimation of machine learning performance in neuroimaging studies of depression

Autor: Daniel Emden, Xiaoyi Jiang, Tilo Kircher, David M. A. Mehler, Volker Arolt, Axel Krug, Ronny Redlich, Scott R. Clark, Tim Hahn, Ramona Leenings, Igor Nenadic, Simon B. Eickhoff, Udo Dannlowski, Bernhard T. Baune, Micah Cearns, Nils Opel, Claas Flint, Nils R. Winter
Rok vydání: 2021
Předmět:
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Population
Computer Science - Computer Vision and Pattern Recognition
Neuroimaging
Sample (statistics)
Machine learning
computer.software_genre
Article
Machine Learning
FOS: Electrical engineering
electronic engineering
information engineering

medicine
Humans
ddc:610
education
Depression (differential diagnoses)
Pharmacology
Depressive Disorder
Major

education.field_of_study
Depression
business.industry
Image and Video Processing (eess.IV)
Diagnostic markers
Small sample
Electrical Engineering and Systems Science - Image and Video Processing
Translational research
Predictive analytics
medicine.disease
Magnetic Resonance Imaging
Psychiatry and Mental health
Sample size determination
FOS: Biological sciences
Quantitative Biology - Neurons and Cognition
Major depressive disorder
Neurons and Cognition (q-bio.NC)
Artificial intelligence
business
Psychology
computer
Zdroj: Neuropsychopharmacology
Neuropsychopharmacology 46(8), 1510-1517 (2021). doi:10.1038/s41386-021-01020-7
ISSN: 1740-634X
0893-133X
Popis: We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.
Databáze: OpenAIRE