Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets. - Citegraph

Paper Info

Title
Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets.

Abstract
Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.

Year	DOI	Venue
2012	10.1186/1471-2105-13-S2-S9	BMC Bioinformatics
Keywords	Field	DocType
multidimensional scaling,genes,algorithms,pyrosequencing,cluster analysis,metagenomics,sequence alignment	Pairwise comparison,Data set,Alignment-free sequence analysis,Biosequence,Multidimensional scaling,Biology,Bioinformatics,Genetics,Cluster analysis,Multiple sequence alignment,Sequence analysis	Journal
Volume	Issue	ISSN
13 Suppl 2	S-2	1471-2105
Citations	PageRank	References
11	0.50	3
Authors
8

Authors (8 rows)

Cited by (11 rows)

References (3 rows)

Name	Order	Citations	PageRank
Adam Hughes	1	81	3.90
Yang Ruan	2	112	6.26
Saliya Ekanayake	3	90	9.34
Seung-Hee Bae	4	571	31.67
Qunfeng Dong	5	504	34.86
Mina Rho	6	34	2.62
Judy Qiu	7	743	43.25
Geoffrey Fox	8	4070	575.38

1