Title
Improved BLAST searches using longer words for protein seeding
Abstract
Motivation: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters. Availability: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords Contact: richa@helix.nih.gov
Year
DOI
Venue
2007
10.1093/bioinformatics/btm479
Bioinformatics
Keywords
Field
DocType
receiver operator characteristic,nucleotides
Data mining,File Transfer Protocol,Heuristic,Receiver operating characteristic,Computer science,Theoretical computer science,Bioinformatics,Seeding,Executable
Journal
Volume
Issue
ISSN
23
21
1367-4803
Citations 
PageRank 
References 
5
1.19
4
Authors
4
Name
Order
Citations
PageRank
Sergey A. Shiryev151.19
Jason S. Papadopoulos235520.72
Alejandro A. Schäffer3827136.66
Richa Agarwala431058.02