Popis: |
As cloud and aerosol interactions remain large uncertainties in current climate models (IPCC) they are of special interest for atmospheric science. It is estimated that more than 70% of all cloud condensation nuclei origin from so-called New Particle Formation, which is the process of gaseous precursors clustering together in the atmosphere and subsequent growth into particles and aerosols. After initial clustering this growth is driven strongly by condensation of low volatile organic compounds (LVOC), that is molecules with saturation vapor pressures (pSat) below 10-6 mbar [1]. These origin from organic molecules emitted by vegetation that are subsequently rapidly oxidized in the air, so-called Biogenic LVOC (BLVOC). We have created a big data set of BLVOC using high-throughput computing and Density Functional Theory (DFT), and use it to train Machine Learning models to predict pSat of previously unseen BLVOC. Figure 1 illustrates some sample molecules form the data. Figure 1: Sample molecules, for small, medium large sizes. Figure 2: Histogram of the calculated saturation vapor pressures. Initially the chemical mechanism GECKO-A provides possible BLVOC molecules in the form of SMILES strings. In a first step the COSMOconf program finds and optimizes structures of possible conformers and provides their energies for the liquid phase on a DFT level of theory. After an additional calculation of the gas phase energies with Turbomole, COSMOtherm calculates thermodynamical properties, such as the pSat, using the COSMO-RS [1] model. We compressed all these computations together in a highly parallelised high-throughput workflow to calculate 32k BLVOC, that include over 7 Mio. molecular conformers. See a histogram of the calculated pSat in Figure 2. We use the calculated pSat to train a Gaussian Process Regression (GPR) machine learning model with the Topological Fingerprint as descriptor for molecular structures. The GPR incorporates noise and outputs uncertainties for predictions on the pSat. These uncertainties and data cluster techniques allow for the active choosing of molecules to include in the training data, so-called Active Learning. Further, we explore using SLISEMAP [2] explainable AI methods to correlate Machine Learning predictions, the high-dimensional descriptors and human-readable properties, such as functional groups. [1] Metzger, A. et al. Evidence for the role of organics in aerosol particle formation under atmospheric conditions. Proc. Natl. Acad. Sci. 107, 6646–6651, 10.1073/pnas.0911330107 (2010)[2] Klamt, A. & Schüürmann, G. Cosmo: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Trans. 2 799–805, 10.1039/P29930000799 (1993).[3] Björklund, A., Mäkelä, J. & Puolamäki, K. SLISEMAP: supervised dimensionality reduction through local explanations. Mach Learn (2022). https://doi.org/10.1007/s10994-022-06261-1 |