binomialRF: Interpretable combinatoric efficiency of random forests to identify biomarker interactions

Autor: Liam Wilson, Hao Helen Zhang, Joanne Berghout, Yves A. Lussier, Samir Rachid Zaim, Colleen Kenost, Wesley Chiu
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Zdroj: BMC Bioinformatics, Vol 21, Iss 1, Pp 1-22 (2020)
BMC Bioinformatics
DOI: 10.1101/681973
Popis: BackgroundIn this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest1 (RF) classifiers are widely used2–7 due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.MethodsbinomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees.ResultsIn simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.ConclusionbinomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data.AvailabilityGithub: https://github.com/SamirRachidZaim/binomialRFSupplementary informationSupplementary analyses and results are available at https://github.com/SamirRachidZaim/binomialRF_simulationStudy
Databáze: OpenAIRE