Did the input data inflation via regression-based levelling of data from various analytical protocols affect the performance of geochemical predictive models?

Autor: Jan Skála, Daniel Žížala, Robert Minařík
Rok vydání: 2023
DOI: 10.5194/egusphere-egu23-11283
Popis: The geochemical predictive models based on environmental correlations were proved to return reliable predictions when the covariates’ feature space is representatively covered by samples. It may be efficient to combine the data from different sampling campaigns. Classical analytical methods of wet mineralisation of trace elements with spectral termination are both cost- and time-consuming. For trace elements, the most common routine is the usage of aqua regia together with < 2 mm sieved samples. Nevertheless, in the Czech Republic, the common usage of partial extraction using cold 2 mol/L nitric acid had preceded the recent practise of aqua regia mineralisation and had left rich legacy datasets from long-term soil monitoring. We tested several models (i.e. simple linear models, step-wise multiple linear models and generalised additive models) to provide a reliable and effective (parsimonious) model for data recalculation between various extractions based on parallel soil analysis of 6,000 representative soil samples. Since all the regression models left highly spatially autocorrelated residuals, we also tested several spatial auto-regressive models among which the geographically weighted regression was found useful. Finally, we tested the predictive models using a quantile regression forest model where the environmental covariates for lithological sources (parent material classification combined with airborne geophysical data) and human-induced sources (night-time lights data, density of mining dumps, density of traffic routes, elements’ deposition rates) were combined with data from remotely sensed surface characterisation (Sentinel-2), multiscale representation of terrain (Gaussian pyramids), and the spatial autoregressive structure of target features (quantile-based buffer distances). We trained several QRF models in high resolution (20 x 20 m) where we researched the effects of using true measured data (3,300 samples) and dataset inflated with regression-recalculated data (11,000 recalculated samples from global and local regression-based levelling) and the potential effects of using the goodness of fit criteria from various regression-based recalculations between methods as weights for final QRF predictive models. The research has been supported by the Technology Agency of the Czech Republic under the research project No. SS03010364.
Databáze: OpenAIRE