Autor: |
Manivannan, Manimozhi, Flynn, James, Sahu, Sombeet, Wang, Shu, Kim, Dong, Gulati, Saurabh, Parikh, Saurabh, Girard, Isabelle, Martin, Roy |
Jazyk: |
angličtina |
Rok vydání: |
2020 |
Předmět: |
|
Zdroj: |
J Biomol Tech |
Popis: |
With the advancements in single-cell sequencing technologies, it is now possible to interrogate thousands of cells in a single experiment for studying genetic variability. Single-Cell DNA platforms like Tapestri is susceptible to errors primarily from PCR and sequencing with rates ranging from 0.5% - 2%. This makes variant calling and minimal residual disease detection challenging. To address these challenges, we developed a novel consensus sequence-based method for correcting the errors, reduce false-positive rates and predict true variants. First, we build a consensus sequence from several reads to predict the correct sequence. The initial layers learn the motifs and local sequence contexts in classifying the patterns. The output of this network is a probability distribution over possible bases and the prediction is the base with highest probability. The bases in the reads are subsequently corrected to the predicted base from the first step model. After error correcting the reads, we used the variants called by Genome Analysis Toolkit to feed into a multi-class classifier network. Our features consist of percent of cells mutated, and the different genotype features including depth, AF and quality of each variant in these cells. The truth labels are generated using tapestri instrument from multiple experiments with known truth. We trained the network on over 200k cells from 13 samples and tested on a larger set of samples. Class imbalance was handled using upsampling the truth data. Our training samples include diverse samples from cell mixtures at various dilution uptill 0.1% and clinical samples processed through tapestri instrument and sequenced on a diverse set of sequencers including miseq and novaseq. With our 2-step error correction and variant prediction model, we significantly improved our median PPV 2-3 fold at 0.5% LOD. This will enable researchers in finding the rare subclone for characterizing MRD. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|