Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function.

Autor: Lim KS; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.; Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States., Reidenbach AG; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States., Hua BK; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.; Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, United States., Mason JW; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.; Novartis Institutes for BioMedical Research, Cambridge, Massachusetts 02139, United States., Gerry CJ; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.; Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, United States., Clemons PA; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States., Coley CW; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.; Chemical Biology and Therapeutics Science Program, Broad Institute, 415 Main Street, Cambridge, Massachusetts 02142, United States.; Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
Jazyk: angličtina
Zdroj: Journal of chemical information and modeling [J Chem Inf Model] 2022 May 23; Vol. 62 (10), pp. 2316-2331. Date of Electronic Publication: 2022 May 10.
DOI: 10.1021/acs.jcim.2c00041
Abstrakt: DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.
Databáze: MEDLINE