Using tandem repeat genomic features for cancer signal detection across multiple cancer types
Autor: | Victor Solovyev, Sidney Tobias, Edmund Wong, Shun H Yip, Cheuk Ying Tang |
---|---|
Rok vydání: | 2022 |
Předmět: | |
Zdroj: | Journal of Clinical Oncology. 40:e13586-e13586 |
ISSN: | 1527-7755 0732-183X |
Popis: | e13586 Background: Next generation sequencing methods enable the identification of molecular signatures predictive of cancers. Large-scale cancer genomic projects such as The Cancer Genome Atlas (TCGA) molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The resulting data provides an opportunity to uncover a list of recurrent genomic aberrations, such as mutations, amplifications, insertions, and deletions, that can be used in machine learning models for multiple cancer type detection. Previous research has demonstrated significant variability in the stability of tandem repeat sequences between different cancer types, with consequences for gene expression, which is consistent with known oncogenic mechanisms. Methods: Our approach identifies a set of tandem repeats from whole-exome sequencing data to analyze, and computes differences relative to the reference genome in the cancer patients’ samples. For each tandem repeat sequence (referred to hereafter as TRS), we count the frequency of occurrence of the TRS in its reference state, as well as its specific variations (deletions/insertions) in each cancer patient exome sequence. These TRS read counts are the features for our cancer type prediction model. Through filtering out features of lower significance, we condense our training and testing datasets down from a half million possible features per sample, to merely thousands of features common across all the samples. From there, we train multi-class one-vs-all logistic regression models rapidly and optimize our feature selection and preprocessing to maximize predictive accuracy. Results: As can be seen in the Table below, our logistic regression model trained on TRS features enables us to predict cancer type with high accuracy in TCGA patient samples (as compared with random prediction accuracy, which would be approximately 10%). Conclusions: This approach lays the groundwork for novel TRS-based genetic tests for early detection and diagnosis of multiple types of cancer. [Table: see text] [Table: see text] |
Databáze: | OpenAIRE |
Externí odkaz: |