Deep multimodal semantic embeddings for speech and images - Citegraph

Paper Info

Title
Deep multimodal semantic embeddings for speech and images

Abstract
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.

Year	DOI	Venue
2015	10.1109/ASRU.2015.7404800	2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Keywords	Field	DocType
Neural networks,multimodal semantic embeddings	Modalities,Visual Objects,Embedding,Annotation,Pattern recognition,Computer science,Convolutional neural network,Speech recognition,Natural language processing,Artificial intelligence,Machine learning,Semantic space	Journal
Volume	Citations	PageRank
abs/1511.03690	16	0.73
References	Authors
13	2

Authors (2 rows)

Cited by (16 rows)

References (13 rows)

Name	Order	Citations	PageRank
David F. Harwath	1	63	8.34
James Glass	2	3123	413.63

1