Prediction of Protein Expression and Growth Rates by Supervised Machine Learning
Autor: | Simiao Zhao |
---|---|
Rok vydání: | 2021 |
Předmět: | |
Zdroj: | Natural Science. 13:301-330 |
ISSN: | 2150-4105 2150-4091 |
Popis: | The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences. |
Databáze: | OpenAIRE |
Externí odkaz: |