Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos. - Citegraph

Paper Info

Title
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.

Abstract
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

Year	DOI	Venue
2021	10.1109/ICCV48922.2021.00791	ICCV
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	13

Authors (13 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Brian Chen	1	0	1.35
Andrew Rouditchenko	2	0	2.03
Kevin Duarte	3	0	0.34
Hilde Kuehne	4	0	1.01
Samuel Thomas	5	536	46.88
Angie Boggust	6	0	0.68
Rameswar Panda	7	85	14.02
B. Kingsbury	8	4175	335.43
Rogério Feris	9	1529	89.95
David F. Harwath	10	63	8.34
James Glass	11	3123	413.63
Michael Picheny	12	1461	920.15
Shih-Fu Chang	13	0	0.68

1