Title
Estimation of the applicability domain of kernel-based machine learning models for virtual screening.
Abstract
The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
Year
DOI
Venue
2010
10.1186/1758-2946-2-2
J. Cheminformatics
Keywords
Field
DocType
support vector regression,machine learning,search space,virtual screening,bioinformatics,structure activity relationship,biomedical research
Data mining,Quantitative structure–activity relationship,Data set,Computer science,Polynomial kernel,Artificial intelligence,Chemical space,Applicability domain,Virtual screening,Kernel (linear algebra),Bioinformatics,Mixture model,Machine learning
Journal
Volume
Issue
ISSN
2
1
1758-2946
Citations 
PageRank 
References 
6
0.53
34
Authors
4
Name
Order
Citations
PageRank
Nikolas Fechner11038.38
Andreas Jahn2562.93
Georg Hinselmann3968.12
Andreas Zell41419137.58