Abstract | ||
---|---|---|
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing VVAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets inthe-wild - WildV-VAD - based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/ICPR48806.2021.9412884 | 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) |
DocType | ISSN | Citations |
Conference | 1051-4651 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sylvain Guy | 1 | 0 | 0.34 |
Stéphane Lathuilière | 2 | 33 | 5.98 |
Pablo Mesejo | 3 | 16 | 3.01 |
Radu Horaud | 4 | 2776 | 261.99 |