Title
Casting To Corpus: Segmenting And Selecting Spontaneous Dialogue For Tts With A Cnn-Lstm Speaker-Dependent Breath Detector
Abstract
This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.
Year
DOI
Venue
2019
10.1109/icassp.2019.8683846
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Keywords
Field
DocType
Spontaneous speech, found data, speech synthesis corpora, breath detection, computational paralinguistics
Prosody,Market segmentation,Pattern recognition,Computer science,Spectrogram,Speech recognition,Artificial intelligence,Classifier (linguistics),Detector
Conference
ISSN
Citations 
PageRank 
1520-6149
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Éva Székely1194.96
Gustav Eje Henter23711.40
Joakim Gustafson339258.37