Title
Batch Sample Design from Databases for Logistic Regression.
Abstract
The prevalence of large observational databases offers potential for identifying predictive relationships among variables of interest, although observational data are generally far less informative and less reliable than experimental data. We consider the problem of selecting a subset of records from a large observational database, for the purpose of designing a small but powerful experiment involving the selected records. It is assumed that the database contains the predictor variables but is missing the response variable, and that the purpose is to fit a logistic regression model after the response is obtained via the experiment. Active learning methods, which treat a similar problem, usually select records sequentially and focus on the single objective of classification accuracy. In contrast, many emerging applications require batch sample designs and have a variety of objectives that may include classification accuracy or accuracy of the estimated parameters, the latter being more in line with the optimal design of experiments (DOE) paradigm. The aim of this paper is to explore batch sampling from databases from a DOE perspective, particularly regarding the configuration, performance, and robustness of the designs that result from the different criteria. Through extensive simulation, we show that DOE-based batch sampling methods can substantially outperform random sampling and the entropy method that is popular in active learning. We also provide insight and guidelines for selecting appropriate design criteria and modeling assumptions. Copyright (C) 2016 John Wiley & Sons, Ltd.
Year
DOI
Venue
2017
10.1002/qre.1992
QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL
Keywords
Field
DocType
optimal design of experiment,active learning,sampling from databases,logistic regression
Econometrics,Data mining,Observational study,Sampling design,Experimental data,Computer science,Robustness (computer science),Logistic regression,Optimal design of experiments,Active learning,Sampling (statistics),Statistics,Database
Journal
Volume
Issue
ISSN
33
1
0748-8017
Citations 
PageRank 
References 
0
0.34
7
Authors
3
Name
Order
Citations
PageRank
Liwen Ouyang100.34
Daniel W Apley27112.66
Sanjay Mehrotra352177.18