Popis: |
The entire chemical-synthetic knowledge created since the days of Lavoisier to the present can be represented as a complex network (Figure 1a) comprising millions of compounds and reactions. While it is simply beyond cognition of any individual human to understand and analyze all this collective chemical knowledge, modern computers have become powerful enough to perform suitable network analyses within reasonable timescales. In this context, a problem that is both fundamentally interesting and practically important is the identification of optimal synthetic pathways leading to desired, known molecules from commercially available substrates. In either manual searches or semiautomated search tools, such as Reaxys, this procedure is done by back-tracking the possible syntheses step-by-step. Such “manual” methods, however, give virtually no chance of finding an optimal pathway, as the number of possible syntheses to consider is very large (for example, ca. 10 within five steps). Moreover, the problem becomes dramatically more complex when one aims to optimize the syntheses of multiple substances simultaneously when, for example, a company producing N products would strive to design synthetic pathways sharing many common substrates/intermediates and minimizing the overall synthetic cost (Figure 1a). As we show herein, however, judicious combination of combinatorial optimization with network search algorithms allows the parallel optimization of tens to thousands of syntheses. The algorithms we describe traverse the network of organic chemistry (henceforth, NOC or simply the network) probing different synthetic paths according to the cost criterion as defined by a combination of labor cost and the cost of staring materials. In a specific case study, we show that our optimization can reduce the cost of an existing synthetic company (here, ProChimia Surfaces) by almost 50%. Overall, this communication is the first instance in which synthetic optimizations are based on the entire body of synthetic knowledge as stored in the NOC and combined with economical descriptors (that is, prices). While each of the individual reactions in the NOC is known, the network search algorithms create new chemical knowledge in the form of near optimal reaction sequences; notably, the syntheses that are optimal for making any molecule individually can be different from those optimizing the synthesis of this and other molecules simultaneously. Our analyses are based on a network of about 7 million reactions and about 7 million substances derived as described in the first communication in this series (also see Refs. [1, 2]). While in our earlier analyses of NOC, the simple dot–arrow representation was typically sufficient, the analysis of specific syntheses involving multiple substrates and/or products requires the so-called bipartite-graph representation with two types of nodes: those corresponding to specific substances (blue dots in Figure 1b), and those representing the reactions (black dots in Figure 1b). This representation of the NOC captures the causal synthetic dependencies and accounts for the fact that a viable synthesis (see the Supporting Information, Section 2) cannot proceed without all of the necessary reactants, which must either be synthesized by another suitable reaction or purchased. Also, as our network searches are intended to compare the actual costs of syntheses, we have linked the NOC to a test Figure 1. The network of organic chemistry and its bipartite wiring plan. a) Small fraction of the network (ca. 0.025%) centered on six target compounds (red). Computational methods described herein allow for the identification of near optimal synthesis plans (inset) despite the size and complexity of the network. b) Illustration of the mapping from a list of chemical reactions to a directed, bipartite network. |