Audio-visual Recognition of Overlapped speech for the LRS2 dataset - Citegraph

Paper Info

Title
Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Abstract
Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

Year	DOI	Venue
2020	10.1109/ICASSP40776.2020.9054127	ICASSP
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	10

Authors (10 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jianwei Yu	1	8	10.92
Shixiong Zhang	2	107	9.34
Wu Jian	3	119	8.59
Shahram Ghorbani	4	2	1.71
Wu Bo	5	0	0.34
Shiyin Kang	6	150	15.05
Shansong Liu	7	2	5.77
Xunying Liu	8	330	52.46
Helen M. Meng	9	1078	172.82
Dong Yu	10	6264	475.73

1