Casting To Corpus: Segmenting And Selecting Spontaneous Dialogue For Tts With A Cnn-Lstm Speaker-Dependent Breath Detector - Citegraph

Paper Info

Title
Casting To Corpus: Segmenting And Selecting Spontaneous Dialogue For Tts With A Cnn-Lstm Speaker-Dependent Breath Detector

Abstract
This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

Year	DOI	Venue
2019	10.1109/icassp.2019.8683846	2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)
Keywords	Field	DocType
Spontaneous speech, found data, speech synthesis corpora, breath detection, computational paralinguistics	Prosody,Market segmentation,Pattern recognition,Computer science,Spectrogram,Speech recognition,Artificial intelligence,Classifier (linguistics),Detector	Conference
ISSN	Citations	PageRank
1520-6149	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Éva Székely	1	19	4.96
Gustav Eje Henter	2	37	11.40
Joakim Gustafson	3	392	58.37

1