Title
Deep multimodal semantic embeddings for speech and images
Abstract
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.
Year
DOI
Venue
2015
10.1109/ASRU.2015.7404800
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Keywords
Field
DocType
Neural networks,multimodal semantic embeddings
Modalities,Visual Objects,Embedding,Annotation,Pattern recognition,Computer science,Convolutional neural network,Speech recognition,Natural language processing,Artificial intelligence,Machine learning,Semantic space
Journal
Volume
Citations 
PageRank 
abs/1511.03690
16
0.73
References 
Authors
13
2
Name
Order
Citations
PageRank
David F. Harwath1638.34
James Glass23123413.63