Title
EXPLORING VISUAL-AUDIO COMPOSITION ALIGNMENT NETWORK FOR QUALITY FASHION RETRIEVAL IN VIDEO
Abstract
Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9413617
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
Fashion Retrieval, Visual-Audio Embedding, Multi-modal Learning, Cross-domain Alignment
Conference
0
PageRank 
References 
Authors
0.34
0
8
Name
Order
Citations
PageRank
Yanhao Zhang118013.90
Jianmin Wu21109.91
Xiong Xiong300.34
Dangwei Li400.34
Chenwei Xie501.35
Yun Zheng65911.91
Pan Pan7104.29
Yinghui Xu800.34