iRNA-m5C_NB: A Novel Predictor to Identify RNA 5-Methylcytosine Sites Based on the Naive Bayes Classifier

Autor: Lei Xu, Xiaoling Li, Lijun Dou, Huaikun Xiang, Hui Ding
Rok vydání: 2020
Předmět:
Zdroj: IEEE Access. 8:84906-84917
ISSN: 2169-3536
DOI: 10.1109/access.2020.2991477
Popis: As one of the widespread RNA post-transcriptional modifications (PTCMs), 5-Methylcytosine (m5C) plays vital roles in better understanding of basic biological mechanisms and major disease treatments. In experiments, traditional high-throughput approaches to find m5C sites are usually expensive and laborious. Additionally, facing with a large number of RNA sequences, developing accurate computational methods to distinguish m5C and non-m5C sites is an efficient solution. Here we introduced a novel predictor, called iRNA-m5C_NB, to identify m5C sites in Home sapiens using Naive Bayes (NB) algorithm. In this method, unbalanced dataset Met935 is firstly analyzed using efficient hybrid-sampling strategy SMOTEEEN. Then top 57 features are selected by the ANOVA F-value from four kinds of well-performance feature extraction techniques, including Bi-profile Bayes (BPB), enhanced Nucleic Acid Composition (ENAC), electron-ion interaction pseudopotentials (EIIP) and mMGap_1. Based on the jackknife test, the evaluated recall for the unbalanced training dataset Met935 is up to 82.81% with MCC of 0.63. And for the independent dataset Test1157, the predictor still shows high recall of 70.06% and MCC of 0.34. It is the first m5C predictor constructed using the unbalanced dataset, and the recall scores are increased by 19.82% and 59.23% for jackknife and independent tests compared with the latest tool RNAm5CPred, respectively. We demonstrate that the proposed predictor iRNA-m5C_NB outperforms other state-of-art models, which hopes to be an efficient and reliable method to identify m5C sites.
Databáze: OpenAIRE