Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms - Citegraph

Paper Info

Title
Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms

Abstract
We propose a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner. Unlike the existing models, our model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we used the existing English and Hindi datasets and collected a new corpus of Japanese speech captions. These spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. Therefore, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments. Experimental results show that the introduction of a third language improves …

Year	DOI	Venue
2020	10.1109/ICASSP40776.2020.9053428	ICASSP
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yasunori Ohishi	1	36	6.45
Akisato Kimura	2	244	28.03
Takahito Kawanishi	3	34	11.04
Kunio Kashino	4	285	68.41
David F. Harwath	5	63	8.34
James Glass	6	3123	413.63

1