Abstract | ||
---|---|---|
Recently, there is a surge of interest in image-text multimodal representation learning, and many neural network based models have been proposed aiming to capture the interaction between two modalities with different forms of functions. Despite their success, a potential limitation of these methods is insufficient to model all kinds of interactions with a set of static parameters. To alleviate this problem, we present a dynamic interaction network, in which the parameters of the interaction function are dynamically generated by a meta network. Additionally, to provide necessary multimodal features that the meta network needs, we propose a new neural module called Multimodal Transformer. Experimentally, we not only make a comprehensively quantitative evaluation on four image-text tasks, but also show some interpretable analyses of our models, revealing the internal working mechanism of the dynamic parameter learning. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1016/j.neucom.2019.10.103 | Neurocomputing |
Keywords | DocType | Volume |
Multimodal learning,Dynamic parameters prediction,Deep neural networks | Journal | 379 |
ISSN | Citations | PageRank |
0925-2312 | 1 | 0.36 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wenshan Wang | 1 | 24 | 9.00 |
Pengfei Liu | 2 | 58 | 7.83 |
Su Yang | 3 | 110 | 14.58 |
Weishan Zhang | 4 | 396 | 52.57 |