Machine Learning Random Forest for predicting onco-somatic variants NGS analysis

Autor: Eric Pellegrino, Coralie Jacques, Nathalie Beaufils, Isabelle Nanni, Antoine Carlioz, Philippe Metellus, L’Houcine Ouafik
Rok vydání: 2021
DOI: 10.21203/rs.3.rs-426134/v1
Popis: Motivation: Since 2017, we are using IonTorrent NGS platform in our hospital in order to diagnose cancer and treatment. Analysis variants at each run take us a longtime and we are still struggling with some variants which look correct on the first look at their metrics but found to be negative when we look further into them. Can any Machine Learning algorithm help us to classify NGS variant calling ? This has determined us to investigate which ML could fit to our NGS data and to develop a tool which can be implemented in Routine in order to help Biologists. Introduction: Nowadays, one of medicine challenges is processing a significant amount of data. It’s particularly true in molecular biology with the advantage of Next Generation Sequencing (NGS) for molecular tumor profile determination and treatment selection. In addition to bioinformatics pipelines, Artificial Intelligence (AI) can offer a very valuable help in analyzing. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. But analyzing the huge amount of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skill and a panel of bioinformatics and biostatistics tools, in which artificial intelligence is now a success factor in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. We present a Machine Learning method to classify pathogenic Single Nucleotide Variants (SNVs), SNP (Single Nucleotide Polymorphism), MNVs (Multiple Nucleotide Variants), Insertion, Deletion detected by NGS from tumors specimens for Colorectal, Melanoma, Lung and Glioma cancer. Methods: We compared our NGS data to different machine learning algorithms using the 10-fold cross validation method and to neural networks (Deep Learning) in order to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnostic. We trained our Machine Learning with 70 % of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with 30 % remaining. The model offering the best accuracy was chosen and implemented in NGS analysis routine. The artificial intelligence was developed with R script language version 3.6.0. Results: We trained our model on 102011 variants. Our best error rate (0.22%) was found with Random Forest Machine Learning (ntree=500 and mtry=4) with an AUC of 0.99. Neural Networks achieved some good scores. The final trained model with Neural Network was able to achieve an accuracy of 98% and a ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate Conclusion: Our RF model shows excellent results for onco-somatic NGS interpretation and it could easily be implemented in other molecular biology laboratories. AI is taking an increasingly important place in molecular biomedical analysis and could be very helpful on processing of amount medical data. Neural Networks showed a good capacity in the classification of variants and in the future may be useful in the prediction of more complex variants.
Databáze: OpenAIRE