Developing Gene-Specific Meta-Predictor of Variant Pathogenicity

Autor:	Anna Rychkova, Carlos Bustamante, Carlos Milla, Curt Scharfe, Justin I. Odegaard, Iris Schrijver, Martina I. Lefterova, MyMy C. Buu
Rok vydání:	2017
Předmět:	0303 health sciences Stability (learning theory) Contrast (statistics) Computational biology Biology computer.software_genre Genome DNA sequencing 3. Good health Random forest 03 medical and health sciences 0302 clinical medicine Feature (machine learning) Data mining Set (psychology) Allele frequency computer 030217 neurology & neurosurgery 030304 developmental biology
DOI:	10.1101/115956
Popis:	Rapid, accurate, and inexpensive genome sequencing promises to transform medical care. However, a critical hurdle to enabling personalized genomic medicine is predicting the functional impact of novel genomic variation. Various methods of missense variants pathogenicity prediction have been developed by now. Here we present a new strategy for developing a pathogenicity predictor of improved accuracy by applying and training a supervised machine learning model in a gene-specific manner. Our meta-predictor combines outputs of various existing predictors, supplements them with an extended set of stability and structural features of the protein, as well as its physicochemical properties, and adds information about allele frequency from various datasets. We used such a supervised gene-specific meta-predictor approach to train the model on the CFTR gene, and to predict pathogenicity of about 1,000 variants of unknown significance that we collected from various publicly available and internal resources. Our CFTR-specific meta-predictor based on the Random Forest model performs better than other machine learning algorithms that we tested, and also outperforms other available tools, such as CADD, MutPred, SIFT, and PolyPhen-2. Our predicted pathogenicity probability correlates well with clinical measures of Cystic Fibrosis patients and experimental functional measures of mutated CFTR proteins. Training the model on one gene, in contrast to taking a genome wide approach, allows taking into account structural features specific for a particular protein, thus increasing the overall accuracy of the predictor. Collecting data from several separate resources, on the other hand, allows to accumulate allele frequency information, estimated as the most important feature by our approach, for a larger set of variants. Finally, our predictor will be hosted on the ClinGen Consortium database to make it available to CF researchers and to serve as a feasibility pilot study for other Mendelian diseases.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::448abbd9ef0a9a87299de10b9558a2b6 https://doi.org/10.1101/115956 Zobrazit plný text záznamu