Efficient classification of massive, unsegmented datastreams - Citegraph

Paper Info

Title
Efficient classification of massive, unsegmented datastreams

Abstract
We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. The presence of noise in the similarity judgements and in the extent of similar regions is taken into account. Our method is effective on very large datastreams, and produces both a classified collection of segments and a set of frequency matrices that define patterns for the induced classes. We applied this method to the problem of finding the sequence-level building blocks of proteins. We first tested the clusterer on synthetic protein data with known evolutionary history. We then applied the method to a large protein sequence database (a datastream of more than elements) and found about 10,000 protein sequence classes which can be divided into definitions of protein families (classes of whole proteins) and definitions of protein building blocks (patterns that appear in otherwise unrelated proteins). These motifs are of significant biological interest

Year	DOI	Venue
1992	10.1016/B978-1-55860-247-2.50034-6	ML
Keywords	Field	DocType
unsegmented datastreams,efficient classification	Protein family,Sequence database,Pattern recognition,Protein sequencing,Computer science,Matrix (mathematics),Synthetic protein,Unsupervised learning,Artificial intelligence,Probabilistic logic,Machine learning,Binary number	Conference
Issue	ISBN	Citations
1	1-5586-247-X	4
PageRank	References	Authors
1.21	3	3

Authors (3 rows)

Cited by (4 rows)

References (3 rows)

Name	Order	Citations	PageRank
Lawrence Hunter	1	174	25.76
Nomi Harris	2	56	9.88
David J. States	3	551	106.06

1