Title
Detection of Underrepresented Biological Sequences using Class-Conditional Distribution Models
Abstract
A labeled sequence data set related to a certain biological property is often biased and, therefore, does not completely capture its diversity in nature. To reduce this sampling bias problem a data mining procedure is proposed for detecting underrepresented relevant sequences. The procedure is aimed at helping domain experts achieve a cost-effective qualitative enlargement of knowledge through an in-depth study of a small number of statistically underrepresented and functionally interesting sequences. Our procedure consists of: (i) learning a class-conditional distribution model on each class of labeled data; (ii) applying the models to select statistically underrepresented unlabeled sequences; and (iii) automatically evaluating their interestingness. An application of the proposed approach is illustrated on an important problem of increasing the data set of confirmed disordered proteins. The obtained results demonstrate the promise of the proposed approach for an efficient reduction of sampling bias in biological databases.
Year
Venue
Keywords
2003
SIAM Proceedings Series
conditional distribution,data mining
Field
DocType
Citations 
Small number,Data mining,Distribution model,Conditional probability distribution,Pattern recognition,Computer science,Sampling bias,Biological database,Artificial intelligence,Data sequences,Labeled data
Conference
2
PageRank 
References 
Authors
0.49
5
4
Name
Order
Citations
PageRank
Slobodan Vucetic163756.38
Dragoljub Pokrajac226419.89
Hongbo M Xie313712.22
Zoran Obradovic41110137.41