Abstract | ||
---|---|---|
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/ASRU.2015.7404800 | 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) |
Keywords | Field | DocType |
Neural networks,multimodal semantic embeddings | Modalities,Visual Objects,Embedding,Annotation,Pattern recognition,Computer science,Convolutional neural network,Speech recognition,Natural language processing,Artificial intelligence,Machine learning,Semantic space | Journal |
Volume | Citations | PageRank |
abs/1511.03690 | 16 | 0.73 |
References | Authors | |
13 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
David F. Harwath | 1 | 63 | 8.34 |
James Glass | 2 | 3123 | 413.63 |