Title
Choosing Non-redundant Representative Subsets Of Protein Sequence Data Sets Using Submodular Optimization.
Abstract
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. A representative subset is a subset of sequences from the original data set that (1) minimizes the redundancy in the representative sequences, and (2) maximizes the representativeness of the subset; that is, every sequence in the full data set has at least one representative that is similar to it. The selected representative subset is then used in downstream analysis in place of the full data set. Previous methods for this task, such as CD-HIT, PISCES and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. These sequence selection methods are very widely used---for example, the CD-HIT papers have been cited a total of >3,000 times (Google Scholar)---and are a standard preprocessing step applied to data sets of protein sequences, cDNA sequences and microbial DNA. In this work, we propose a principled framework, Repset, for representative protein sequence subset selection using submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. Our approach involves defining a submodular objective function that quantifies the desirable properties of a given subset of sequences, and then applying a submodular optimization algorithm to choose a representative subset that maximizes this function. Framing this task as an optimization problem has two benefits. First, it allows us to leverage a large existing literature on submodular optimization. This led to the development of a method that is computationally efficient, empirically outperforms other methods, and, in contrast to all existing solutions to this problem, is backed by theoretical guarantees of its performance. In particular, Repset outperforms threshold-based methods on two measures: (1) representative subsets produced by Repset have lower redundancy, as measured by the pairwise similarity of sequences in the set, and (2) these subsets have greater structural diversity, as measured using the SCOPe library of protein domain structures. Second, the optimization-based framework gives the method great flexibility. The user can select one of a variety of objective functions to optimize according to their needs. For example, the user can minimize the redundancy of sequences in the subset, maximize the representativeness of the subset of the full set, or some combination of the two. The user can also choose to prefer some sequences over others, such as preferring long sequences over shorter ones. More broadly, this paper demonstrates the utility of submodular optimization for computational biology. Applying submodular optimization to a new problem has two simple steps: (1) devise a submodular objective function, and (2) apply a standard optimization algorithm to this objective. Therefore, we believe that the strategy we employ here will have analogous applications to hundreds of other problems in computational biology.
Year
DOI
Venue
2018
10.1002/prot.25461
BCB
Keywords
Field
DocType
discrete optimization,diversity,protein sequence analysis,redundancy,representative subsets,submodular maximization
Data set,Combinatorics,Crystallography,Protein sequencing,Chemistry,Submodular set function
Conference
Volume
Issue
ISSN
86.0
4.0
0887-3585
ISBN
Citations 
PageRank 
978-1-4503-5794-4
1
0.35
References 
Authors
19
3
Name
Order
Citations
PageRank
Maxwell Libbrecht151.84
Jeff A. Bilmes227816.88
William Stafford Noble32907203.56