Title
Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding.
Abstract
Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Max</italic> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Average</italic> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Linear Regression</italic> , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Top-Bottom Instances</italic> . The experimental results on <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in vivo</italic> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in vitro</italic> datasets illustrate the performance of the proposed approach. Moreover, models built on <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in vitro</italic> data using WSCNN can predict <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in vivo</italic> protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">in vivo</italic> protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.
Year
DOI
Venue
2020
10.1109/TCBB.2018.2864203
IEEE/ACM transactions on computational biology and bioinformatics
Keywords
DocType
Volume
DNA,Proteins,Convolutional neural networks,Predictive models,In vivo,In vitro,Sequential analysis
Journal
17
Issue
ISSN
Citations 
2
1545-5963
1
PageRank 
References 
Authors
0.35
0
4
Name
Order
Citations
PageRank
Qinhu Zhang140.75
Lin Zhu2744.93
Wenzheng Bao32810.40
De-Shuang Huang45532357.50