On the use of real-world datasets for reaction yield prediction.
Autor: | Saebi M; Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA nchawla@nd.edu., Nan B; Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA owiest@nd.edu., Herr JE; Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA owiest@nd.edu., Wahlers J; Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA owiest@nd.edu., Guo Z; Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA nchawla@nd.edu., Zurański AM; Department of Chemistry, Princeton University Princeton New Jersey 08544 USA., Kogej T; Molecular AI, Discovery Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden., Norrby PO; Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden., Doyle AG; Department of Chemistry, Princeton University Princeton New Jersey 08544 USA.; Department of Chemistry and Biochemistry, University of California Los Angeles California 90095 USA., Chawla NV; Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA nchawla@nd.edu., Wiest O; Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA owiest@nd.edu. |
---|---|
Jazyk: | angličtina |
Zdroj: | Chemical science [Chem Sci] 2023 Mar 13; Vol. 14 (19), pp. 4997-5005. Date of Electronic Publication: 2023 Mar 13 (Print Publication: 2023). |
DOI: | 10.1039/d2sc06041h |
Abstrakt: | The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki-Miyaura and Buchwald-Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions. Competing Interests: There are no conflicts to declare. (This journal is © The Royal Society of Chemistry.) |
Databáze: | MEDLINE |
Externí odkaz: |