A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data
Autor: | Chao Wu, Komal S. Rathi, Jessica C. Mar, Jeremy Leipzig, Mahdi Sarmady, Pichai Raman, Aydin Tozeren, Deanne Taylor, Laurence de Torrenté, Samuel E. Zimmerman |
---|---|
Rok vydání: | 2019 |
Předmět: |
Cancer Research
Percentile Kaplan-Meier Estimate Computational biology Biology 03 medical and health sciences 0302 clinical medicine Neoplasms Biomarkers Tumor Genetics Humans Biomarker discovery Molecular Biology Categorical variable Gene Survival analysis Proportional Hazards Models Base Sequence Receiver operating characteristic Sequence Analysis RNA Proportional hazards model Gene Expression Profiling Prognosis Survival Analysis 3. Good health Gene Expression Regulation Neoplastic ROC Curve Data Interpretation Statistical 030220 oncology & carcinogenesis Biomarker (medicine) |
Zdroj: | Cancer Genetics. :1-12 |
ISSN: | 2210-7762 |
DOI: | 10.1016/j.cancergen.2019.04.004 |
Popis: | Identifying genetic biomarkers of patient survival remains a major goal of large-scale cancer profiling studies. Using gene expression data to predict the outcome of a patient's tumor makes biomarker discovery a compelling tool for improving patient care. As genomic technologies expand, multiple data types may serve as informative biomarkers, and bioinformatic strategies have evolved around these different applications. For categorical variables such as a gene's mutation status, biomarker identification to predict survival time is straightforward. However, for continuous variables like gene expression, the available methods generate highly-variable results, and studies on best practices are lacking. We investigated the performance of eight methods that deal specifically with continuous data. K-means, Cox regression, concordance index, D-index, 25th–75th percentile split, median-split, distribution-based splitting, and KaplanScan were applied to four RNA-sequencing (RNA-seq) datasets from the Cancer Genome Atlas. The reliability of the eight methods was assessed by splitting each dataset into two groups and comparing the overlap of the results. Gene sets that had been identified from the literature for a specific tumor type served as positive controls to assess the accuracy of each biomarker using receiver operating characteristic (ROC) curves. Artificial RNA-Seq data were generated to test the robustness of these methods under fixed levels of gene expression noise. Our results show that methods based on dichotomizing tend to have consistently poor performance while C-index, D-index, and k-means perform well in most settings. Overall, the Cox regression method had the strongest performance based on tests of accuracy, reliability, and robustness. |
Databáze: | OpenAIRE |
Externí odkaz: |