Title
Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information.
Abstract
Identification of DNA–protein binding sites in protein sequence plays an essential role in a wide variety of biological processes. In particular, there are huge volumes of protein sequences accumulated in the post-genomic era. In this study, we propose a new prediction approach appropriate for imbalanced DNA–protein binding sites data. Specifically, motivated by the imbalanced problem of the distribution of DNA–protein binding and non-binding sites, we employ the Adaptive Synthetic Sampling (ADASYN) approach to over-sample the positive data and Bootstrap strategy to under-sample the negative data to balance the number of the binding and non-binding samples. Furthermore, we employ the three types of features: the position specific scoring matrix, one-hot encoding and predicted solvent accessibility, to encode the sequence-based feature of each protein residue. In addition, we design an ensemble convolutional neural network classifier to handle the imbalance problem between binding and non-binding sites in protein sequence. Extensive experiments were conducted on the real DNA–protein binding sites dataset, PDNA-543, PDNA-224 and PDNA-316, in order to validate the effectiveness of our method on predicting the binding sites by ten-fold cross-validation metric. The experimental results demonstrate that our method achieves a high prediction performance and outperforms the state-of-the-art sequence-based DNA–protein binding sites predictors in terms of the Sensitivity, Specificity, Accuracy, Precision and Mathew’s Correlation Coefficient (MCC). Our method can obtain the MCC values of 0.63, 0.48 and 0.67 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively. Compared with the state-of-the art prediction models, the MCC values for our method are increased by at least 0.24, 0.13 and 0.23 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively.
Year
DOI
Venue
2019
10.1016/j.engappai.2019.01.003
Engineering Applications of Artificial Intelligence
Keywords
Field
DocType
DNA–protein binding sites,Bootstrap method,Convolutional neural networks,ADASYN sampling
ENCODE,Plasma protein binding,Binding site,Protein sequencing,Convolutional neural network,Computer science,Artificial intelligence,Computational biology,Classifier (linguistics),Machine learning,Bootstrapping (electronics),Encoding (memory)
Journal
Volume
ISSN
Citations 
79
0952-1976
0
PageRank 
References 
Authors
0.34
0
6
Name
Order
Citations
PageRank
Yongqing Zhang172.67
Shaojie Qiao220125.93
Shengjie Ji300.68
Nan Han4698.64
Dingxiang Liu530.91
Jiliu Zhou645058.21