Popis: |
Microbiome research has become a ubiquitous component of contemporary clinical research, with potential to uncover associations between microbiome composition and disease. With microbiome data becoming more prevalent, the need to understand how to analyse such data is increasingly important. One complicating property of microbiome data is that it is inherently compositional and thus constrained to simplex-space; because of this, it cannot be analysed directly using conventional statistical methods. In this paper, we transform the compositional data in order to lift the simplex-constraint, and then investigate the viability of applying conventional statistical methods to the data. Using a high-dimensional data set containing gut-microbiome samples from Parkinson's- and control patients, we first transform the raw data to centred log-ratio scale, and then use permutational multivariate analysis of variance (PERMANOVA) to test if there are differences between the two groups with respect to bacterial abundances. We then employ three machine learning classifiers -- Logistic regression, XGBoost, and Random Forest -- and evaluate their performance on the transformed data. The results from PERMANOVA indicate that gut-microbiome composition in the patients with Parkinson's disease indeed differ from that in the control individuals. The Random Forest method achieves the highest classification accuracy, followed by XGBoost, while logistic regression performs poorly, questioning its viability in analysis of high-dimensional compositional microbiome data. We find four bacterial species of high importance for the classification: Prevotella copri, Prevotella sp. CAG 520, Akkermansia muciniphila, and Butyricimonas virosa, where the first three have been previously mentioned in the Parkinson's literature. |