Benchmarking and Testing Machine Learning Approaches with BARRA:CuRDa, a Curated RNA-Seq Database for Cancer Research
Autor: | Bruno César Feltes, Joice de Faria Poloni, Márcio Dorn |
---|---|
Rok vydání: | 2021 |
Předmět: |
Database
Process (engineering) Computer science business.industry Pseudogene Feature selection RNA-Seq Benchmarking computer.software_genre Machine learning Field (computer science) Data set Computational Mathematics Computational Theory and Mathematics Modeling and Simulation Genetics Benchmark (computing) Cancer research Artificial intelligence business Molecular Biology computer |
Zdroj: | Journal of Computational Biology. 28:931-944 |
ISSN: | 1557-8666 |
DOI: | 10.1089/cmb.2020.0463 |
Popis: | RNA-seq is gradually becoming the dominating technique employed to access the global gene expression in biological samples, allowing more flexible protocols and robust analysis. However, the nature of RNA-seq results imposes new data-handling challenges when it comes to computational analysis. With the increasing employment of machine learning (ML) techniques in biomedical sciences, databases that could provide curated data sets treated with state-of-the-art approaches already adapted to ML protocols, become essential for testing new algorithms. In this study, we present the Benchmarking of ARtificial intelligence Research: Curated RNA-seq Database (BARRA:CuRDa). BARRA:CuRDa was built exclusively for cancer research and is composed of 17 handpicked RNA-seq data sets for Homo sapiens that were gathered from the Gene Expression Omnibus, using rigorous filtering criteria. All data sets were individually submitted to sample quality analysis, removal of low-quality bases and artifacts from the experimental process, removal of ribosomal RNA, and estimation of transcript-level abundance. Moreover, all data sets were tested using standard approaches in the field, which allows them to be used as benchmark to new ML approaches. A feature selection analysis was also performed on each data set to investigate the biological accuracy of basic techniques. Results include genes already related to their specific tumoral tissue a large amount of long noncoding RNA and pseudogenes. BARRA:CuRDa is available at http://sbcb.inf.ufrgs.br/barracurda. |
Databáze: | OpenAIRE |
Externí odkaz: |