Stochastic versus Stepwise Strategies for Quantitative Structure−Activity Relationship GenerationHow Much Effort May the Mining for Successful QSAR Models Take?

Autor: Alexandre Varnek, Dragos Horvath, Vitaly P. Solov'ev, Fanny Bonachéra, Cédric Gaudin
Přispěvatelé: Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Institut de Chimie de Strasbourg, Centre National de la Recherche Scientifique (CNRS)-Université Louis Pasteur - Strasbourg I-Institut de Chimie du CNRS (INC)
Jazyk: angličtina
Rok vydání: 2007
Předmět:
Zdroj: Journal of Chemical Information and Modeling
Journal of Chemical Information and Modeling, American Chemical Society, 2007, 47 (3), pp.927-939. ⟨10.1021/ci600476r⟩
ISSN: 1549-9596
1549-960X
DOI: 10.1021/ci600476r⟩
Popis: Descriptor selection in QSAR typically relies on a set of upfront working hypotheses in order to boil down the initial descriptor set to a tractable size. Stepwise regression, computationally cheap and therefore widely used in spite of its potential caveats, is most aggressive in reducing the effectively explored problem space by adopting a greedy variable pick strategy. This work explores an antipodal approach, incarnated by an original Genetic Algorithm (GA)-based Stochastic QSAR Sampler (SQS) that favors unbiased model search over computational cost. Independent of a priori descriptor filtering and, most important, not limited to linear models only, it was benchmarked against the ISIDA Stepwise Regression (SR) tool. SQS was run under various premises, varying the training/validation set splitting scheme, the nonlinearity policy, and the used descriptors. With the considered three anti-HIV compound sets, repeated SQS runs generate sometimes poorly overlapping but nevertheless equally well validating model sets. Enabling SQS to apply nonlinear descriptor transformations increases the problem space: nevertheless, nonlinear models tend to be more robust validators. Model validation benchmarking showed SQS to match the performance of SR or outperform it in cases when the upfront simplifications of SR "backfire", even though the robust SR got trapped in local minima only once in six cases. Consensus models from large SQS model sets validate well--but not outstandingly better than SR consensus equations. SQS is thus a robust QSAR building tool according to standard validation tests against external sets of compounds (of same families as used for training), but many of its benefits/drawbacks may yet not be revealed by such tests. SQS results are a challenge to the traditional way to interpret and exploit QSAR: how to deal with thousands of well validating models, nonetheless providing potentially diverging applicability ranges and predicted values for external compounds. SR does not impose such burden on the user, but is "betting" on a single equation or a narrow consensus model to behave properly in virtual screening a sound strategy? By posing these questions, this article will hopefully act as an incentive for the long-haul studies needed to get them answered.
Databáze: OpenAIRE