Title
Insight into the protein solubility driving forces with neural attention.
Abstract
Author summary The solubility of proteins is a crucial biophysical aspect when it comes to understanding many human diseases and to improve the industrial processes for protein production. Due to its relevance, computational methods have been devised in order to study and possibly optimize the solubility of proteins. In this work we apply a deep-learning technique, called neural attention to predict protein solubility while "opening" the model itself to interpretability, even though Machine Learning models are usually considered black boxes. Thank to the attention mechanism, we show that i) our model implicitly learns complex patterns related to emergent, protein folding-related, aspects such as to recognize beta-amyloidosis regions and that ii) the N-and C-termini are the regions with the highes signal fro solubility prediction. When it comes to enhancing the solubility of proteins, we, for the first time, propose to investigate the synergistic effects of tandem mutations instead of "single" mutations, suggesting that this could minimize the number of required proposed mutations. Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility.
Year
DOI
Venue
2020
10.1371/journal.pcbi.1007722; 10.1371/journal.pcbi.1007722.r001; 10.1371/journal.pcbi.1007722.r002; 10.1371/journal.pcbi.1007722.r003; 10.1371/journal.pcbi.1007722.r004
PLOS COMPUTATIONAL BIOLOGY
DocType
Volume
Issue
Journal
16
4
ISSN
Citations 
PageRank 
1553-734X
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Daniele Raimondi100.34
Gabriele Orlando200.68
Piero Fariselli385196.03
Yves Moreau41202105.05