What Convnets Make for Image Captioning? - Citegraph

Paper Info

Title
What Convnets Make for Image Captioning?

Abstract
Nowadays, a general pipeline for the image captioning task takes advantage of image representations based on convolutional neural networks (CNNs) and sequence modeling based on recurrent neural networks (RNNs). As captioning performance closely depends on the discriminative capacity of CNNs, our work aims to investigate the effects of different Convnets (CNN models) on image captioning. We train three Convnets based on different classification tasks: single-label, multi-label and multi-attribute, and then feed visual representations from these Convnets into a Long Short-Term Memory (LSTM) to model the sequence of words. Since the three Convnets focus on different visual contents in one image, we propose aggregating them together to generate a richer visual representation. Furthermore, during testing, we use an efficient multi-scale augmentation approach based on fully convolutional networks (FCNs). Extensive experiments on the MS COCO dataset provide significant insights into the effects of Convnets. Finally, we achieve comparable results to the state-of-the-art for both caption generation and image-sentence retrieval tasks.

Year	DOI	Venue
2017	10.1007/978-3-319-51811-4_34	Lecture Notes in Computer Science
Keywords	Field	DocType
Image captioning,Convolutional neural networks,Aggregation module,Long short-term memory,Multi-scale testing	Closed captioning,Pattern recognition,Convolutional neural network,Computer science,Long short term memory,Recurrent neural network,Artificial intelligence,Sequence modeling,Discriminative model	Conference
Volume	ISSN	Citations
10132	0302-9743	3
PageRank	References	Authors
0.40	24	3

Authors (3 rows)

Cited by (3 rows)

References (24 rows)

Name	Order	Citations	PageRank
Yu Liu	1	198	25.45
Yanming Guo	2	128	13.06
Michael S. Lew	3	2742	166.02

1