Accurate prediction of B-form/A-form DNA conformation propensity from primary sequence: A machine learning and free energy handshake
Autor: | Mandar Kulkarni, Arnab Mukherjee, Abhijit Gupta |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Hyperparameter
Handshake business.industry Computer science Generalization nested cross-validation DNA sequence DNA conformation General Decision Sciences Overfitting Machine learning computer.software_genre Article LightGBM machine learning Artificial intelligence Gradient boosting business computer Host (network) genome Predictive modelling Interpretability |
Zdroj: | Patterns |
ISSN: | 2666-3899 |
Popis: | Summary DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of “overfitting” and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based on SHAP (SHapley Additive exPlanations) that offers crucial insight into model interpretability. Our detailed model-building strategy and robust statistical validation protocols tackle the formidable challenge of working on small datasets, which is often the case in biological and medical data. Highlights • A robust machine learning model to predict A- or B-DNA conformation • Outcome of machine learning model is explained with free energy values • Our approach works well under class imbalance and limited data constraints The bigger picture The sequence in the genome of an organism encodes all the information of life. We combine a data-driven approach using machine learning (ML) and the results of free energy calculations to offer a fresh perspective on this long-standing problem of prediction of DNA conformation (A or B) from the sequence. We trained our ML model using sophisticated state-of-the art algorithms such as LightGBM along with a nested cross-validation strategy to overcome the common problems associated with data bias and overfitting when constrained by limited data size. Our study will serve the broader interest of researchers who are not only seeking accurate and reliable predictive models but also want to understand the physical and chemical origins behind the predictions. We have developed a robust predictive machine learning-based predictive model to accurately predict DNA conformation (A or B) from just the DNA sequence. Unlike a black-box model, our approach offers key chemical and thermodynamic insights into predictions made by our models. |
Databáze: | OpenAIRE |
Externí odkaz: |