Fast embedding methods for clustering tens of thousands of sequences. - Citegraph

Paper Info

Title
Fast embedding methods for clustering tens of thousands of sequences.

Abstract
Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires computer memory and time proportional to N2 for N sequences. For small N or say up to 10000 or so, this can be accomplished in reasonable times for sequences of moderate length. For very large N, however, this becomes increasingly prohibitive. In this paper, we have tested variations on a class of published embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignments. Source code is available on request from the authors.

Year	DOI	Venue
2008	10.1016/j.compbiolchem.2008.03.005	Computational Biology and Chemistry
Keywords	Field	DocType
Clustering,Edit distance,Multiple sequence alignment,Data embedding	Sequence clustering,Edit distance,Fuzzy clustering,Data stream clustering,Embedding,Correlation clustering,Algorithm,Theoretical computer science,Distance matrix,Cluster analysis,Genetics,Mathematics	Journal
Volume	Issue	ISSN
32	4	1476-9271
Citations	PageRank	References
3	0.49	6
Authors
5

Authors (5 rows)

Cited by (3 rows)

References (6 rows)

Name	Order	Citations	PageRank
gordon blackshields	1	460	33.48
Mark Larkin	2	412	31.10
iain m wallace	3	484	34.62
andreas wilm	4	571	37.26
Desmond G. Higgins	5	1263	383.91

1