Popis: |
Background: The new and emerging discipline of data science demands reproducibility, which is vital in science and presents a significant challenge for high throughput genomics. To further complicate matters, large and complex projects require collaboration by multiple investigators examining and analyzing the massive data of multiple genomic modalities from different perspectives. Today, researchers are rarely able to reproduce published genomic studies for a variety of reasons, for example: i) differences between versions of software used, ii) lack of detail regarding software parameters, iii) lack of data access, and iv) source code not provided. Here we combine our data infrastructure approach with a molecular infrastructure and apply it to the exploration of a multimodality genomic analysis of a patient with a pulmonary pneumocytoma. Methods: Open source methods are utilized that include the SQLite database with R and python packages and custom code. Results: Our approach is able to generate a wide variety of plots and tables for the purposes of exploratory data analysis (EDA) and/or other user-specific analyses, such as finding differentially expressed genes (DEGs). In addition to traditional EDA plots, the R library RCircos is used to visualize multiple NGS studies (eg, 7) in single plot. Differential gene expression (DGE) analysis takes normalized RNA-based read count data and performs a statistical analysis, to find quantitative changes in expression levels between different experimental groups. A DGE analysis report is routinely generated. An abbreviated copy number variation (CNV) report derived from an ultra-low-pass whole genome (tumor/germline) NGS approach is also generated. A Python/Jupyter Notebook utilizing a library from scikit-learn is used to generate a clustergram plot. This approach is used as part of finding the optimal number of clusters for a K-Means analysis. RNA-seq data normalized across three sample types using DESeq2 were used in this example. Finally, advanced pathway analysis is performed for the identification of activated and deactivated molecular pathways. Conclusion: The next evolution in oncology research and cancer care are being driven by data science. In the field of genomic data science, accuracy and reproducibility remains a considerable challenge due to the sheer size, complexity, and dynamic nature of the experimental data plus relative inventiveness of the quantitative biology approaches. The accuracy and reproducibility challenge does not just block the path to new scientific discoveries, more importantly, it may lead to a scenario where critical findings used for medical decision making are found to be incorrect. Our approach has been developed to meet the unmet need of improving accuracy and reproducibility in genomic data science. Specific findings related to the rare pneumocytoma tumor will be presented. Citation Format: Li Ma, Erich Peterson, Mathew Steliga, Jason Muesse, Katy Marino, Konstantinos Arnaoutakis, Ikjae Shin, Donald J. Johann. Applying reproducible genomic data science methods for the analysis of a rare tumor type [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 5038. |