Popis: |
Classification among coding sequences (CDS) and non-coding RNA (ncRNA) sequences is a challenge and several machine learning models have been developed for the same. Since the frequency of curated CDS is many-folds as compared to that of the ncRNAs, we devised a novel approach to work with the complete datasets from fifteen diverse species. In our proposed binary approach, we replaced all the 'A's and 'T's with '0's and 'G's and 'C's with '1's to obtain a binary form of CDS and ncRNAs. The k-mer analysis of these binary sequences revealed that the frequency of binary patterns among the CDS and ncRNAs can be used as features to distinguish among them. Using insights from these distinguishing frequencies, we used k-nearest neighbor classifier to classify among them. Our strategy is not only time-efficient but leads to significantly increased performance metrics in terms of Matthews Correlation Coefficient (MCC), Accuracy, F1 score, Precision, Recall and AUC-ROC, for species like P. paniscus, M. mulatta, M. lucifugus, G. gallus, C. japonica, C. abingdonii, A. carolinensis, D. melanogaster and C. elegans when compared with the conventional ATGC approach. Additionally, we also show that the performance obtained for diverse species tested on the model based on H. sapiens, correlated with the geological evolutionary timeline, thereby further strengthening our approach. Therefore, we propose that CDS and ncRNAs can be efficiently classified using "2-character" binary frequency as compared to "4-character" frequency of ATGC approach. Thus, our highly efficient binary approach can replace the more complex ATGC approach successfully. |