D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery.

Autor: Perdikopanis, Nikos, Giannakakis, Antonis, Kavakiotis, Ioannis, Hatzigeorgiou, Artemis G.
Předmět:
Zdroj: Biology (2079-7737); Aug2024, Vol. 13 Issue 8, p563, 19p
Abstrakt: Simple Summary: Small open reading frames (sORFs; fewer than 300 nucleotides or fewer than 100 amino acids) are short DNA sequences that can regulate cellular processes or produce functional peptides. Identifying these sORFs, especially in non-genic regions, remains challenging despite advances in sequencing technology. To address this, we developed D-sORF, a machine-learning framework that predicts coding sORFs using the nucleotide context and motifs around start codons. D-sORF achieves 94.74% precision and 92.37% accuracy, outperforming experimental methods such as ribosome sequencing (Ribo-Seq) in identifying peptide-producing transcripts and filtering out false positives. Unlike traditional conservation-based methods, D-sORF's significant advantage is recognizing sORFs with low sequence similarity. Its robust prediction capabilities make it a valuable tool for researchers. It can enhance our understanding of sORFs' roles, potentially leading to discoveries in terms of gene regulation and new therapeutic targets. By accurately distinguishing small coding sequences from non-coding ones, D-sORF significantly contributes to genomic research and its applications in medicine and biotechnology. Small open reading frames (sORFs; <300 nucleotides or <100 amino acids) are widespread across all genomes, and an increasing variety of them appear to be translating from non-genic regions. Over the past few decades, peptides produced from sORFs have been identified as functional in various organisms, from bacteria to humans. Despite recent advances in next-generation sequencing and proteomics, accurate annotation and classification of sORFs remain a rate-limiting step toward reliable and high-throughput detection of small proteins from non-genic regions. Additionally, the cost of computational methods utilizing machine learning is lower than that of biological experiments, and they can be employed to detect sORFs, laying the groundwork for biological experiments. We present D-sORF, a machine-learning framework that integrates the statistical nucleotide context and motif information around the start codon to predict coding sORFs. D-sORF scores directly for coding identity and requires only the underlying genomic sequence, without incorporating parameters such as the conservation, which, in the case of sORFs, may increase the dispersion of scores within the significantly less conserved non-genic regions. D-sORF achieves 94.74% precision and 92.37% accuracy for small ORFs (using the 99 nt medium length window). When D-sORF is applied to sORFs associated with ribosomes, the identification of transcripts producing peptides (annotated by the Ensembl IDs) is similar to or superior to experimental methodologies based on ribosome-sequencing (Ribo-Seq) profiling. In parallel, the recognition of putative negative data, such as the intron-containing transcripts that associate with ribosomes, remains remarkably low, indicating that D-sORF could be efficiently applied to filter out false-positive sORFs from Ribo-Seq data because of the non-productive ribosomal binding or noise inherent in these protocols. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index