Popis: |
Deep learning methodologies have shown outstanding success in different image analysis applications. They rely on the abundance of labelled observations to build the model. However, frequently it is expensive to gather labelled observations of data, making the usage of deep learning models imprudent. Different practical examples of this challenge can be found in the analysis of medical images. For instance, labelling images to solve medical imaging problems require expensive labelling efforts, as experts (i.e., radiologists) are required to produce reliable labels. Semi-supervised learning is an increasingly popular alternative approach to deal with small labelled datasets and increase model test accuracy, by leveraging unlabelled data. However, in real-world usage settings, an unlabelled dataset might present a different distribution than the labelled dataset (i.e., the labelled dataset was sampled from a target clinic and the unlabelled dataset from a source clinic). There are different causes for a distribution mismatch between the labelled and the unlabelled dataset: a prior probability shift, a set of observations from unseen classes in the labelled dataset, and a covariate shift of the features. In this work, we assess the impact of this phenomena, for the state of the art semi-supervised model known as MixMatch. We evaluate both label and feature distribution mismatch impact in MixMatch in a real-world application: the classification of chest X-ray images for COVID-19 detection. We also test the performance gain of using MixMatch for malignant cancer detection using mammograms. For both study cases we managed to build new datasets from a private clinic in Costa Rica. We propose different approaches to address different causes of a distribution mismatch between the labelled and unlabelled datasets. First, regarding the prior probability shift, a simple model-oriented approach to deal with this challenge, is proposed. According to our experiments, the proposed method yielded accuracy gains of up to 14% statistical significance. As for more challenging disiii tribution mismatch settings caused by a covariate shift in the feature space and sampling unseen classes in the unlabelled dataset we propose a data-oriented approach to deal with such challenges. As an assessment tool, we propose a set of dataset dissimilarity metrics designed to measure how much performance benefit a semi-supervised training regime can get from using a specific unlabelled dataset over another. Also, two techniques designed to score each unlabelled observation according to how much accuracy might bring including such observation into the unlabelled dataset for semi-supervised training are proposed. These scores can be used to discard harmful unlabelled observations. The novel methods use a generic feature extractor to build a feature space where the metrics and scores are computed. The dataset dissimilarity metrics yielded a linear correlation of up to 90% to the performance of the state-of-the-art Mix- Match semi-supervised training algorithm, suggesting that such metrics can be used to assess the quality of an unlabelled dataset. As for the scoring methods for unlabelled data, according to our tests, using them to discard harmful unlabelled data, was able to increase the performance of MixMatch to around 20%. This in the context of medical image analysis applications. |