Title
MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks
Abstract
With the ever-expanding volume of visual images on the Internet, automatic image aesthetic prediction is becoming more and more important in computer vision field. Considering the image aesthetic assessment is a highly subjective and complex task, some researchers resort to the user comments to aid aesthetic prediction. However, these methods only achieve limited success because 1) they rely heavily on convolution to extract visual features, which is difficult to capture the spatial interaction of visual elements in image composition; 2) they treat the image features extraction and textual feature extraction as two distinct tasks and ignore the inter-relationships between these two features. We address these challenges by proposing a Multimodal Self-and-Collaborative Attention Network (MSCAN). More specifically, the self-attention module calculates the response at a position by attending to all positions in the images, thus it can effectively encode spatial interaction of the visual elements. To model the complex image-textual feature relations, a co-attention module is used to jointly perform the textual-guided visual attention and visual-guided textual attention. Then the attended multimodal features are aggregated and sent into a two-layer MLP to obtain the aesthetic values. Extensive experiments over two large benchmarks demonstrate that the proposed MSCAN outperforms the state-of-the-arts by a large margin for unified aesthetic prediction tasks.
Year
DOI
Venue
2021
10.1016/j.neucom.2020.10.046
Neurocomputing
Keywords
DocType
Volume
Photo aesthetic assessment,Multimodal learning,Self-attention mechanism,Co-attention mechanism
Journal
430
ISSN
Citations 
PageRank 
0925-2312
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Xiaodan Zhang123.41
Xinbo Gao25534344.56
Lihuo He317919.11
Wen Lu4253.35