Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion - Citegraph

Paper Info

Title
Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

Abstract
ABSTRACTTo comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

Year	DOI	Venue
2021	10.1145/3474085.3479214	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Beibei Zhang	1	33	7.20
Fan Yu	2	0	2.03
Yanxin Gao	3	0	0.34
Tongwei Ren	4	328	30.22
Gang-Shan Wu	5	27	6.75

1