Title
Efficient classification of massive, unsegmented datastreams
Abstract
We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. The presence of noise in the similarity judgements and in the extent of similar regions is taken into account. Our method is effective on very large datastreams, and produces both a classified collection of segments and a set of frequency matrices that define patterns for the induced classes. We applied this method to the problem of finding the sequence-level building blocks of proteins. We first tested the clusterer on synthetic protein data with known evolutionary history. We then applied the method to a large protein sequence database (a datastream of more than elements) and found about 10,000 protein sequence classes which can be divided into definitions of protein families (classes of whole proteins) and definitions of protein building blocks (patterns that appear in otherwise unrelated proteins). These motifs are of significant biological interest
Year
DOI
Venue
1992
10.1016/B978-1-55860-247-2.50034-6
ML
Keywords
Field
DocType
unsegmented datastreams,efficient classification
Protein family,Sequence database,Pattern recognition,Protein sequencing,Computer science,Matrix (mathematics),Synthetic protein,Unsupervised learning,Artificial intelligence,Probabilistic logic,Machine learning,Binary number
Conference
Issue
ISBN
Citations 
1
1-5586-247-X
4
PageRank 
References 
Authors
1.21
3
3
Name
Order
Citations
PageRank
Lawrence Hunter117425.76
Nomi Harris2569.88
David J. States3551106.06