Abstract | ||
---|---|---|
ABSTRACT Protein secondary structure prediction and high - throughput drug screen data mining are two important applications in bioinformatics The data is represented in sparse feature spaces and can be unrepresentative of future data Su - pervised learners in this context will display their inher - ent bias toward certain solutions, generally solutions that t the training set well In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size A su cient number of ensemble mem - bers using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset Ex - periments on several datasets demonstrate the e ectiveness of the approach We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry We then extend our ensemble approach to create an over - generalized classi - er for prediction by reducing the individual subsample size The ensemble strategy using small subsamples has the ef - fect of averaging over a wider range of hypotheses We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over - generalization, speci cally through the use of ensembles of small subsamples |
Year | Venue | Keywords |
---|---|---|
2002 | BIOKDD | data mining,feature space,high throughput,drug discovery |
Field | DocType | Citations |
Training set,Protein secondary structure prediction,Data mining,Computer science,Artificial intelligence,Bioinformatics,Classifier (linguistics),Machine learning | Conference | 4 |
PageRank | References | Authors |
0.55 | 13 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Steven Eschrich | 1 | 89 | 10.81 |
Nitesh Chawla | 2 | 7257 | 345.79 |
Lawrence O. Hall | 3 | 5543 | 335.87 |