Abstract | ||
---|---|---|
We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1145/3267851.3267878 | 18TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA'18) |
Keywords | Field | DocType |
gesture generation, deep learning, neural networks, long short-term memory | Headset,Motion capture,Computer science,Gesture,Time consistency,Utterance,Speech recognition,Artificial intelligence,Deep learning,Artificial neural network,Perception,Multimedia | Conference |
Citations | PageRank | References |
6 | 0.44 | 14 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dai Hasegawa | 1 | 26 | 7.62 |
Naoshi Kaneko | 2 | 12 | 2.23 |
Shinichi Shirakawa | 3 | 83 | 11.70 |
Hiroshi Sakuta | 4 | 18 | 6.18 |
Kazuhiko Sumi | 5 | 192 | 24.84 |