Decoding visemes: improving machine lipreading (PhD thesis). - Citegraph

Paper Info

Title
Decoding visemes: improving machine lipreading (PhD thesis).

Abstract
Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing u0026 computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term viseme is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Leeu0027s is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Leeu0027s. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.

Year	Venue	Field
2017	arXiv: Computer Vision and Pattern Recognition	Sensory cue,Speech processing,Pattern recognition,Display resolution,Viseme,Gesture,Computer science,Speech recognition,Ground truth,Artificial intelligence,Decoding methods,Cluster analysis
DocType	Volume	Citations
Journal	abs/1710.01288	0
PageRank	References	Authors
0.34	39	1

Authors (1 rows)

Cited by (0 rows)

References (39 rows)

Name	Order	Citations	PageRank
Helen L. Bear	1	30	7.10

1