UnCorrupt SMILES: a novel approach to de novo design

Autor: Linde Schoenmaker, Olivier Béquignon, Willem Jespers, Gerard van Westen
Rok vydání: 2022
DOI: 10.26434/chemrxiv-2022-x3zng
Popis: Generative deep learning models have emerged as a powerful approach for de novo drug design, as they aid researchers in finding new molecules with desired properties. Despite continuous improvements in the field, a subset of the outputs that sequence-based de novo generators produce cannot be progressed due to errors. Here, we propose to fix these invalid outputs post hoc. In similar tasks, transformer models from the field of natural language processing have been shown to be very effective. Therefore, here this type of model was trained to translate invalid Simplified Molecular-Input Line-Entry System (SMILES) into valid representations. The performance of this SMILES corrector was evaluated on four representative methods of de novo generation: a recurrent neural network (RNN), a target-directed RNN, a generative adversarial network (GAN), and a variational autoencoder (VAE). This study has found that the percentage of invalid outputs from these specific generative models ranges between 4 and 89 %, with different models having different error type distributions. Post hoc correction of SMILES increases model validity, with the SMILES corrector fixing 35 to 80 % of invalid model outputs. While, corrector models trained with one error per input sequence alter 60 to 90 % of invalid inputs, a higher performance was obtained for transformer models trained with multiple errors per input. In this case, the best model was able to correct 60 to 95 % of invalid generator outputs. Further analysis showed that these fixed molecules are comparable to the correct molecules from the de novo generators with regard to novelty and similarity. Additionally, the SMILES corrector can also be used to expand the amount of interesting new molecules within the targeted chemical space. Introducing different errors into existing molecules yields novel analogs with a uniqueness of 39 % and a novelty of approximately 20 %. The results of this research demonstrate that SMILES correction is a viable post hoc extension and can enhance the search for better drug candidates.
Databáze: OpenAIRE