Title
Predicting Conversation Outcomes Using Multimodal Transformer
Abstract
Analysis of communication effectiveness is an important task for understanding business outcomes. Prior research has shown that voice data can be used to predict communication effectiveness. However, to our knowledge, no existing studies have used both vocal and verbal cues to predict conversation outcomes in naturally occurring, dyadic business interactions. We use recorded audio calls collected from a partnering Fortune 500 firm that captures conversations between inside salespeople and business customers. Analysis of communication effectiveness is accomplished by transcribing these audio files and subsequently segmenting each conversation into customer and salesperson speaker turns to enable extraction of audio features and text embeddings for each speaker turn. All the speaker turns from the same conversation can be treated as a time series data, which can be modeled by temporal models, like LSTM or transformers. In this paper we propose that a multimodal transformer network (MTN) can capture the importance of different speaker turns and can be used to effectively predict the outcome of the call using both audio and text features. Results from the proposed model outperform current state-of-the-art results and reveal that text features offer superior outcome prediction compared to audio features.
Year
DOI
Venue
2021
10.1109/IJCNN52387.2021.9533935
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)
Keywords
DocType
ISSN
conversation, communication, multimodal, self-attention, transformer, sentiment analysis
Conference
2161-4393
Citations 
PageRank 
References 
0
0.34
0
Authors
5
Name
Order
Citations
PageRank
Can Li183.13
Wenbo Wang235.48
Bitty Balducci300.34
Detelina Marinova400.68
Yi Shang51383104.53