Title
Making automated multiple alignments of very large numbers of protein sequences.
Abstract
Recent developments in sequence alignment software have made possible multiple sequence alignments (MSAs) of >100 000 sequences in reasonable times. At present, there are no systematic analyses concerning the scalability of the alignment quality as the number of aligned sequences is increased.We benchmarked a wide range of widely used MSA packages using a selection of protein families with some known structures and found that the accuracy of such alignments decreases markedly as the number of sequences grows. This is more or less true of all packages and protein families. The phenomenon is mostly due to the accumulation of alignment errors, rather than problems in guide-tree construction. This is partly alleviated by using iterative refinement or selectively adding sequences. The average accuracy of progressive methods by comparison with structure-based benchmarks can be improved by incorporating information derived from high-quality structural alignments of sequences with solved structures. This suggests that the availability of high quality curated alignments will have to complement algorithmic and/or software developments in the long-term.Benchmark data used in this study are available at http://www.clustal.org/omega/homfam-20110613-25.tar.gz and http://www.clustal.org/omega/bali3fam-26.tar.gz.Supplementary data are available at Bioinformatics online.
Year
DOI
Venue
2013
10.1093/bioinformatics/btt093
Bioinformatics
Keywords
Field
DocType
protein family,possible multiple sequence alignment,high quality,curated alignment,benchmark data,large number,sequence alignment software,automated multiple alignment,alignment error,average accuracy,protein sequence,high-quality structural alignment,alignment quality,dna sequencing,sequence analysis
Sequence alignment,Iterative refinement,Protein family,Data mining,Computer science,Software,Bioinformatics,Multiple sequence alignment,Sequence analysis,Scalability
Journal
Volume
Issue
ISSN
29
8
1367-4811
Citations 
PageRank 
References 
19
1.05
22
Authors
4
Name
Order
Citations
PageRank
Fabian Sievers1785.49
David Dineen2241.59
andreas wilm357137.26
Desmond G. Higgins41263383.91