Title
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms
Abstract
We propose a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner. Unlike the existing models, our model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we used the existing English and Hindi datasets and collected a new corpus of Japanese speech captions. These spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. Therefore, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments. Experimental results show that the introduction of a third language improves …
Year
DOI
Venue
2020
10.1109/ICASSP40776.2020.9053428
ICASSP
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Yasunori Ohishi1366.45
Akisato Kimura224428.03
Takahito Kawanishi33411.04
Kunio Kashino428568.41
David F. Harwath5638.34
James Glass63123413.63