Abstract | ||
---|---|---|
A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an \"end-to-end\" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1145/3077136.3080671 | SIGIR |
Keywords | Field | DocType |
Image Captioning, Multimodal Attention, Hierarchical Recurrent Neural Network, Long-Short Term Memory Model | ENCODE,Closed captioning,Convolutional neural network,Computer science,Recurrent neural network,Speech recognition,Time delay neural network,Artificial neural network | Conference |
ISBN | Citations | PageRank |
978-1-4503-5022-8 | 5 | 0.38 |
References | Authors | |
6 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yong Cheng | 1 | 21 | 5.17 |
Huang Fei | 2 | 17 | 4.28 |
Lian Zhou | 3 | 34 | 5.77 |
Cheng Jin | 4 | 78 | 14.92 |
Yuejie Zhang | 5 | 127 | 25.82 |
Tao Zhang | 6 | 422 | 100.57 |