The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource. - Citegraph

Paper Info

Title
The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource.

Abstract
This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.

Year	Venue	Keywords
2016	LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION	spoken language resource,under-resourced language,unsupervised transcriptions
Field	DocType	Citations
Speech corpus,Computer science,Speech recognition,Natural language processing,Artificial intelligence,Linguistics,Spoken language	Conference	0
PageRank	References	Authors
0.34	0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Andrej Zgank	1	89	11.55
Mirjam Sepesy Maučec	2	506	26.34
Darinka Verdonik	3	16	4.76

1