Abstract | ||
---|---|---|
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and text-speech-gesture alignments, and through a demonstration video at https://simonalexanderson.github.io/IVA2020 . |
Year | DOI | Venue |
---|---|---|
2020 | 10.1145/3383652.3423874 | IVA |
DocType | ISSN | Citations |
Conference | Proceedings of the 20th ACM International Conference on
Intelligent Virtual Agents (IVA '20), 2020, 3 pages | 0 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Simon Alexanderson | 1 | 25 | 4.66 |
Éva Székely | 2 | 8 | 3.29 |
Gustav Eje Henter | 3 | 37 | 11.40 |
Taras Kucherenko | 4 | 4 | 3.77 |
Jonas Beskow | 5 | 668 | 96.64 |