Predicting Conversation Outcomes Using Multimodal Transformer - Citegraph

Paper Info

Title
Predicting Conversation Outcomes Using Multimodal Transformer

Abstract
Analysis of communication effectiveness is an important task for understanding business outcomes. Prior research has shown that voice data can be used to predict communication effectiveness. However, to our knowledge, no existing studies have used both vocal and verbal cues to predict conversation outcomes in naturally occurring, dyadic business interactions. We use recorded audio calls collected from a partnering Fortune 500 firm that captures conversations between inside salespeople and business customers. Analysis of communication effectiveness is accomplished by transcribing these audio files and subsequently segmenting each conversation into customer and salesperson speaker turns to enable extraction of audio features and text embeddings for each speaker turn. All the speaker turns from the same conversation can be treated as a time series data, which can be modeled by temporal models, like LSTM or transformers. In this paper we propose that a multimodal transformer network (MTN) can capture the importance of different speaker turns and can be used to effectively predict the outcome of the call using both audio and text features. Results from the proposed model outperform current state-of-the-art results and reveal that text features offer superior outcome prediction compared to audio features.

Year	DOI	Venue
2021	10.1109/IJCNN52387.2021.9533935	2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)
Keywords	DocType	ISSN
conversation, communication, multimodal, self-attention, transformer, sentiment analysis	Conference	2161-4393
Citations	PageRank	References
0	0.34	0
Authors
5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Can Li	1	8	3.13
Wenbo Wang	2	3	5.48
Bitty Balducci	3	0	0.34
Detelina Marinova	4	0	0.68
Yi Shang	5	1383	104.53

1