Cross-species prediction of essential genes in insects through machine learning and sequence-based attributes

Autor: de Castro Gm, Monteiro Tas, Hastenreiter Z, Francisco Pereira Lobo
Rok vydání: 2021
Předmět:
DOI: 10.1101/2021.03.15.433440
Popis: Insects are organisms with a vast phenotypic diversity and key ecological roles. Several insect species also have medical, agricultural and veterinary importance as parasites and vectors of diseases. Therefore, strategies to identify potential essential genes in insects may reduce the resources needed to find molecular players in central processes of insect biology. Furthermore, the detection of essential genes that occur only in certain groups within insects, such as lineages containing insect pests and vectors, may provide a more rational approach to select essential genes for the development of insecticides with fewer off-target effects. However, most predictors of essential genes in multicellular eukaryotes using machine learning rely on expensive and laborious experimental data to be used as gene features, such as gene expression profiles or protein-protein interactions. This information is not available for the vast majority of insect species, which prevents this strategy to be effectively used to survey genomic data from non-model insect species for candidate essential genes. Here we present a general machine learning strategy to predict essential genes in insects using only sequence-based attributes (statistical and physicochemical data). We validate our strategy using genomic data for the two insect species where large-scale gene essentiality data is available: Drosophila melanogaster (fruit fly, Diptera) and Tribolium castaneum (red flour beetle, Coleoptera). We used publicly available databases plus a thorough literature review to obtain databases of essential and non-essential genes for D. melanogaster and T. castaneum, and proceeded by computing sequence-based attributes that were used to train statistical models (Random Forest and Gradient Boosting Trees) to predict essential genes for each species. Both models are capable of distinguishing essential from non-essential genes significantly better than zero-rule classifiers. Furthermore, models trained in one insect species are also capable of predicting essential genes in the other species significantly better than expected by chance. The Random Forest D. melanogaster model can also distinguish between essential and non-essential T. castaneum genes with no known homologs in the fly significantly better than a zero-rule model, demonstrating that it is possible to use our models to predict lineage-specific essential genes in a phylogenetically distant insect order. Here we report, to the best of our knowledge, the development and validation of the first general predictor of essential genes in insects using sequence-based attributes that can, in principle, be computed for any insect species where genomic information is available. The code and data used to predict essential genes in insects are freely available at https://github.com/g1o/GeneEssentiality/.
Databáze: OpenAIRE