Title
Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents’ Speech?
Abstract
AbstractFor conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.
Year
DOI
Venue
2021
10.1145/3486580
ACM Transactions on Applied Perception
Keywords
DocType
Volume
Embodied conversational agents (ECAs), virtual acoustics, prosody, accentuation, speech, text-to-speech, audio, embodiment
Journal
18
Issue
ISSN
Citations 
4
1544-3558
0
PageRank 
References 
Authors
0.34
0
7
Name
Order
Citations
PageRank
J Ehret100.34
Andrea Bönsch285.90
L Aspöck300.34
CT Röhr400.34
Stefan Baumann540.84
Martine Grice613124.80
Janina Fels721.11