Title
Compositional adjustment of Dirichlet mixture priors.
Abstract
Dirichlet mixture priors provide a Bayesian formalism for scoring alignments of protein profiles to individual sequences, which can be generalized to constructing scores for multiple-alignment columns. A Dirichlet mixture is a probability distribution over multinomial space, each of whose components can be thought of as modeling a type of protein position. Applied to the simplest case of pairwise sequence alignment, a Dirichlet mixture is equivalent to an implied symmetric substitution matrix. For alphabets of even size L, Dirichlet mixtures with L/2 components and symmetric substitution matrices have an identical number of free parameters. Although this suggests the possibility of a one-to-one mapping between the two formalisms, we show that there are some symmetric matrices no Dirichlet mixture can imply, and others implied by many distinct Dirichlet mixtures. Dirichlet mixtures are derived empirically from curated sets of multiple alignments. They imply "background" amino acid frequencies characteristic of these sets, and should thus be non-optimal for comparing proteins with non-standard composition. Given a mixture Theta, we seek an adjusted Theta' that implies the desired composition, but that minimizes an appropriate relative-entropy-based distance function. To render the problem tractable, we fix the mixture parameter as well as the sum of the Dirichlet parameters for each component, allowing only its center of mass to vary. This linearizes the constraints on the remaining parameters. An approach to finding Theta' may be based on small consecutive parameter adjustments. The relative entropy of two Dirichlet distributions separated by a small change in their parameter values implies a quadratic cost function for such changes. For a small change in implied background frequencies, this function can be minimized using the Lagrange-Newton method. We have implemented this method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.
Year
DOI
Venue
2010
10.1089/cmb.2010.0117
JOURNAL OF COMPUTATIONAL BIOLOGY
Keywords
Field
DocType
algorithms,combinatorics,linear programming,machine learning,statistics
Dirichlet-multinomial distribution,Hierarchical Dirichlet process,Combinatorics,Multinomial distribution,Generalized Dirichlet distribution,Artificial intelligence,Dirichlet distribution,Concentration parameter,Substitution matrix,Mathematics,General Dirichlet series,Machine learning
Journal
Volume
Issue
ISSN
17.0
12
1066-5277
Citations 
PageRank 
References 
4
0.60
7
Authors
3
Name
Order
Citations
PageRank
Xugang Ye1204.40
Yi-Kuo Yu214014.43
Stephen F Altschul318026.55