Title
Generalization Methods in Bioinformatics
Abstract
ABSTRACT Protein secondary structure prediction and high - throughput drug screen data mining are two important applications in bioinformatics The data is represented in sparse feature spaces and can be unrepresentative of future data Su - pervised learners in this context will display their inher - ent bias toward certain solutions, generally solutions that t the training set well In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size A su cient number of ensemble mem - bers using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset Ex - periments on several datasets demonstrate the e ectiveness of the approach We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry We then extend our ensemble approach to create an over - generalized classi - er for prediction by reducing the individual subsample size The ensemble strategy using small subsamples has the ef - fect of averaging over a wider range of hypotheses We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over - generalization, speci cally through the use of ensembles of small subsamples
Year
Venue
Keywords
2002
BIOKDD
data mining,feature space,high throughput,drug discovery
Field
DocType
Citations 
Training set,Protein secondary structure prediction,Data mining,Computer science,Artificial intelligence,Bioinformatics,Classifier (linguistics),Machine learning
Conference
4
PageRank 
References 
Authors
0.55
13
3
Name
Order
Citations
PageRank
Steven Eschrich18910.81
Nitesh Chawla27257345.79
Lawrence O. Hall35543335.87