Prediction of Protein Expression and Growth Rates by Supervised Machine Learning

Autor: Simiao Zhao
Rok vydání: 2021
Předmět:
Zdroj: Natural Science. 13:301-330
ISSN: 2150-4105
2150-4091
Popis: The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.
Databáze: OpenAIRE