Title | ||
---|---|---|
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms |
Abstract | ||
---|---|---|
We propose a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner. Unlike the existing models, our model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we used the existing English and Hindi datasets and collected a new corpus of Japanese speech captions. These spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. Therefore, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments. Experimental results show that the introduction of a third language improves … |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/ICASSP40776.2020.9053428 | ICASSP |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yasunori Ohishi | 1 | 36 | 6.45 |
Akisato Kimura | 2 | 244 | 28.03 |
Takahito Kawanishi | 3 | 34 | 11.04 |
Kunio Kashino | 4 | 285 | 68.41 |
David F. Harwath | 5 | 63 | 8.34 |
James Glass | 6 | 3123 | 413.63 |