Abstract | ||
---|---|---|
We present and evaluate a new Visual Voice Activity Detection method based on Spatiotemporal Gabor filters (STem-VVAD). Since Spatiotemporal Gabor filters are dynamic, they offer an attractive method to separate speech from non-speech frames in video, even though they have not been used for this purpose before. We evaluate our method on two datasets, which differ in the ratio of speech to non-speech frames (high versus low), as well as in the head orientation of the speakers (frontal versus profile). We compare models on different regions (applied to the mouth, the head or the entire video frame), and do so both for speaker-dependent, individual models and speaker-independent, generic models. In general, best performances are obtained for speaker-dependent STem-VVAD applied to the mouth region, and combining information from different speeds. In all these cases, the system outperforms two reference systems, relying on frame differencing and static Gabor filters respectively, showing that Spatiotemporal Gabor filters indeed are beneficial for visual voice detection. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1007/s12193-015-0187-2 | J. Multimodal User Interfaces |
Keywords | Field | DocType |
Visual voice activity detection,Facial movements,Spatiotemporal Gabor filters | Computer vision,Mouth region,Computer science,Voice activity detection,Speech recognition,Artificial intelligence,Facial movement | Journal |
Volume | Issue | ISSN |
9 | 3 | 1783-7677 |
Citations | PageRank | References |
2 | 0.50 | 22 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bart Joosten | 1 | 7 | 2.35 |
Eric O. Postma | 2 | 195 | 27.10 |
Emiel Krahmer | 3 | 866 | 110.30 |