Development and Application of a Data-Driven Reaction Classification Model: Comparison of an Electronic Lab Notebook and Medicinal Chemistry Literature.

Autor: Ghiandoni GM; Information School , University of Sheffield , Regent Court, 211 Portobello , Sheffield S1 4DP , United Kingdom., Bodkin MJ; Evotec (U.K.) Ltd. , 114 Innovation Drive , Milton Park, Abingdon OX14 4RZ , United Kingdom., Chen B; Chemistry Department , University of Sheffield , Dainton Building , Brook Hill, Sheffield S3 7HF , United Kingdom., Hristozov D; Evotec (U.K.) Ltd. , 114 Innovation Drive , Milton Park, Abingdon OX14 4RZ , United Kingdom., Wallace JEA; Evotec (U.K.) Ltd. , 114 Innovation Drive , Milton Park, Abingdon OX14 4RZ , United Kingdom., Webster J; Information School , University of Sheffield , Regent Court, 211 Portobello , Sheffield S1 4DP , United Kingdom., Gillet VJ; Information School , University of Sheffield , Regent Court, 211 Portobello , Sheffield S1 4DP , United Kingdom.
Jazyk: angličtina
Zdroj: Journal of chemical information and modeling [J Chem Inf Model] 2019 Oct 28; Vol. 59 (10), pp. 4167-4187. Date of Electronic Publication: 2019 Sep 26.
DOI: 10.1021/acs.jcim.9b00537
Abstrakt: Reaction classification has often been considered an important task for many different applications, and has traditionally been accomplished using hand-coded rule-based approaches. However, the availability of large collections of reactions enables data-driven approaches to be developed. We present the development and validation of a 336-class machine learning-based classification model integrated within a Conformal Prediction (CP) framework to associate reaction class predictions with confidence estimations. We also propose a data-driven approach for "dynamic" reaction fingerprinting to maximize the effectiveness of reaction encoding, as well as developing a novel reaction classification system that organizes labels into four hierarchical levels (SHREC: Sheffield Hierarchical REaction Classification). We show that the performance of the CP augmented model can be improved by defining confidence thresholds to detect predictions that are less likely to be false. For example, the external validation of the model reports 95% of predictions as correct by filtering out less than 15% of the uncertain classifications. The application of the model is demonstrated by classifying two reaction data sets: one extracted from an industrial ELN and the other from the medicinal chemistry literature. We show how confidence estimations and class compositions across different levels of information can be used to gain immediate insights on the nature of reaction collections and hidden relationships between reaction classes.
Databáze: MEDLINE