Title
Tandem machine learning for the identification of genes regulated by transcription factors.
Abstract
The identification of promoter regions that are regulated by a given transcription factor has traditionally relied upon the identification and distributions of binding sites recognized by the factor. In this study, we have developed a tandem machine learning approach for the identification of regulatory target genes based on these parameters and on the corresponding binding site information contents that measure the affinities of the factor for these cognate elements.This method has been validated using models of DNA binding sites recognized by the xenobiotic-sensitive nuclear receptor, PXR/RXRalpha, for target genes within the human genome. An information theory-based weight matrix was first derived and refined from known PXR/RXRalpha binding sites. The promoter region of candidate genes was scanned with the weight matrix. A novel information density-based clustering algorithm was then used to identify clusters of information rich sites. Finally, transformed data representing metrics of location, strength and clustering of binding sites were used for classification of promoter regions using an ensemble approach involving neural networks, decision trees and Naïve Bayesian classification. The method was evaluated on a set of 24 known target genes and 288 genes known not to be regulated by PXR/RXRalpha. We report an average accuracy (proportion of correctly classified promoter regions) of 71%, sensitivity of 73%, and specificity of 70%, based on multiple cross-validation and the leave-one-out strategy. The performance on a test set of 13 genes showed that 10 were correctly classified.We have developed a machine learning approach for the successful detection of gene targets for transcription factors with high accuracy. The method has been validated for the transcription factor PXR/RXRalpha and has the potential to be extended to other transcription factors.
Year
DOI
Venue
2005
10.1186/1471-2105-6-204
BMC Bioinformatics
Keywords
Field
DocType
cross validation,information theory,machine learning,information content,candidate gene,gene expression profiling,transcription factor,binding sites,artificial intelligence,cluster analysis,decision tree,nuclear receptor,gene regulation,algorithms,gene targeting,human genome,transcription factors,bayesian classification,bioinformatics,neural network,nucleic acids,microarrays,binding site
Tandem,Gene,Binding site,Biology,Artificial intelligence,Bioinformatics,Genetics,Affinities,Gene expression profiling,Transcription factor,DNA microarray,Machine learning
Journal
Volume
Issue
ISSN
6
1
1471-2105
Citations 
PageRank 
References 
3
0.38
8
Authors
5
Name
Order
Citations
PageRank
Deendayal Dinakarpandian1666.97
Venetia Raheja230.38
Saumil Mehta351.09
Erin G Schuetz430.38
Peter K Rogan5403.14