Title
Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba
Abstract
ABSTRACTVideos grow to be one of the largest mediums on the Internet. E-commerce platforms like Alibaba need to process millions of video data across multimedia (e.g., visual, audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary) every day. In this work, we aim to develop a once and for all pretraining technique for diverse modalities and downstream tasks. To achieve this, we make the following contributions: (1) We propose a self-supervised multi-modal co-training framework. It takes cross-modal pseudo-label consistency as the supervision and can jointly learn representations of multiple modalities. (2) We introduce several novel techniques (e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal convolution and parallel data transmission and processing) to optimize the training process, making billion-scale stable training feasible. (3) We construct a large-scale multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously produces pretrained video, audio, image, motion, and text networks. (4) Finetuning from our pretrained models, we obtain significant performance gains and faster convergence on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned representation on public datasets. Despite the domain gap between our commodity-centric pretraining and the action-centric evaluation data, we show superior results against state-of-the-arts.
Year
DOI
Venue
2021
10.1145/3474085.3481541
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
9
Name
Order
Citations
PageRank
Lianghua Huang101.69
Yu Liu200.34
Xiangzeng Zhou300.34
Ansheng You400.68
Ming Li500.34
Bin Wang600.34
Yingya Zhang700.68
Pan Pan834.16
Yinghui Xu917220.23