Title
Fusing Dual-Event Data Sets for Machine Learning Models and Their Evaluation.
Abstract
The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multidrug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external data set. A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine, or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual data sets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three data sets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest 5-fold cross-validation ROC scores can outperform other models in a test set dependent manner. We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Data set fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.
Year
DOI
Venue
2013
10.1021/ci400480s
JOURNAL OF CHEMICAL INFORMATION AND MODELING
Keywords
Field
DocType
bayes theorem,roc curve,cytotoxins,decision trees,support vector machines,vero cells,artificial intelligence
Data mining,Decision tree,Data set,Mycobacterium tuberculosis,Chemistry,Artificial intelligence,Cheminformatics,Bayes' theorem,Support vector machine,Recursive partitioning,Bioinformatics,Machine learning,Bayesian probability
Journal
Volume
Issue
ISSN
53
11
1549-9596
Citations 
PageRank 
References 
4
0.53
21
Authors
3
Name
Order
Citations
PageRank
Sean Ekins1688.76
Joel S Freundlich2202.55
Robert C Reynolds3101.64