Title
Learning to Describe Differences Between Pairs of Similar Images.
Abstract
In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a firstpass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as multi-sentence generation, the proposed model outperforms the models that use attention alone.
Year
DOI
Venue
2018
10.18653/v1/d18-1436
EMNLP
DocType
Volume
Citations 
Conference
abs/1808.10584
0
PageRank 
References 
Authors
0.34
21
2
Name
Order
Citations
PageRank
Harsh Jhamtani1196.51
Taylor Berg-Kirkpatrick255435.93