Generalization Methods in Bioinformatics - Citegraph

Paper Info

Title
Generalization Methods in Bioinformatics

Abstract
ABSTRACT Protein secondary structure prediction and high - throughput drug screen data mining are two important applications in bioinformatics The data is represented in sparse feature spaces and can be unrepresentative of future data Su - pervised learners in this context will display their inher - ent bias toward certain solutions, generally solutions that t the training set well In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size A su cient number of ensemble mem - bers using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset Ex - periments on several datasets demonstrate the e ectiveness of the approach We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry We then extend our ensemble approach to create an over - generalized classi - er for prediction by reducing the individual subsample size The ensemble strategy using small subsamples has the ef - fect of averaging over a wider range of hypotheses We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over - generalization, speci cally through the use of ensembles of small subsamples

Year	Venue	Keywords
2002	BIOKDD	data mining,feature space,high throughput,drug discovery
Field	DocType	Citations
Training set,Protein secondary structure prediction,Data mining,Computer science,Artificial intelligence,Bioinformatics,Classifier (linguistics),Machine learning	Conference	4
PageRank	References	Authors
0.55	13	3

Authors (3 rows)

Cited by (4 rows)

References (13 rows)

Name	Order	Citations	PageRank
Steven Eschrich	1	89	10.81
Nitesh Chawla	2	7257	345.79
Lawrence O. Hall	3	5543	335.87

1