Abstract | ||
---|---|---|
Visual speech recognition (VSR), also known as lipreading is a task that recognizes word or phrase using video clip of lip movement. Traditional VSR methods are limited in that they are based mostly on VSR of frontal view facial movement. This limitation should be relaxed to include lip movement from all angles. In this paper, we propose a pose invariant network which can recognize word spciken from any arbitrary view input. The architecture that combines convolutional neural network (CNN) with bidirectional long short-term memory (LSTM) is trained in a multi-task manner such that the pose and the word spoken are jointly classified. Here, pose classification is considered as the auxiliary task. To comparatively evaluate the performance of the proposed multi-task learning, OuluVS2 benchmark dataset is considered. The experimental results show that the deep model learned based on the proposed multi-task learning method prove its advantage compared to previous single-view VSR methods and also previous multi-view lipreading methods. This deep model achieved recognition performance of 95.0% accuracy on OuluVS2 dataset. |
Year | Venue | Keywords |
---|---|---|
2017 | 2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP) | lipreading, multi view, multi task, pose-invariant, Visual Speech Recognition |
Field | DocType | ISSN |
Multi-task learning,Task analysis,Pattern recognition,Visualization,Convolutional neural network,Computer science,Phrase,Speech recognition,Artificial intelligence,Facial movement | Conference | 1522-4880 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
HouJeung Han | 1 | 0 | 0.34 |
Sunghun Kang | 2 | 5 | 2.00 |
Chang D. Yoo | 3 | 375 | 45.88 |