Abstract | ||
---|---|---|
We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. The presence of noise in the similarity judgements and in the extent of similar regions is taken into account. Our method is effective on very large datastreams, and produces both a classified collection of segments and a set of frequency matrices that define patterns for the induced classes. We applied this method to the problem of finding the sequence-level building blocks of proteins. We first tested the clusterer on synthetic protein data with known evolutionary history. We then applied the method to a large protein sequence database (a datastream of more than elements) and found about 10,000 protein sequence classes which can be divided into definitions of protein families (classes of whole proteins) and definitions of protein building blocks (patterns that appear in otherwise unrelated proteins). These motifs are of significant biological interest |
Year | DOI | Venue |
---|---|---|
1992 | 10.1016/B978-1-55860-247-2.50034-6 | ML |
Keywords | Field | DocType |
unsegmented datastreams,efficient classification | Protein family,Sequence database,Pattern recognition,Protein sequencing,Computer science,Matrix (mathematics),Synthetic protein,Unsupervised learning,Artificial intelligence,Probabilistic logic,Machine learning,Binary number | Conference |
Issue | ISBN | Citations |
1 | 1-5586-247-X | 4 |
PageRank | References | Authors |
1.21 | 3 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lawrence Hunter | 1 | 174 | 25.76 |
Nomi Harris | 2 | 56 | 9.88 |
David J. States | 3 | 551 | 106.06 |