Overcoming Data Sparsity in Automatic Transcription of Dictated Medical Findings - Citegraph

Paper Info

Title
Overcoming Data Sparsity in Automatic Transcription of Dictated Medical Findings

Abstract
This paper presents a method for introducing class n-gram language models as a means for overcoming data sparsity in the training of an automatic speech recognition (ASR) system aimed at transcription of dictated medical findings composed predominantly in the Serbian language, including occasional phrases in Latin. The classes used by the model are defined with the specific aim of avoiding the need of identifying an appropriate orthographic expansion of each abbreviation, number or other non-orthographic element in a particular context. Generated language models are decoded in Kaldi using token passing, and lattices generated in this way are rescored using recurrent neural network language models (RNNLM). Although the proposed approach requires extensive effort for initial definition of classes based on existing text corpora of medical findings, it improves the quality of the model and increases the degree of automation in the processing of future training corpora. As such, the proposed method is particularly suitable for training on noisy data, full of misspel-lings and other errors, such as medical findings. The feasibility of the approach has been tested on a corpus of medical findings in the domain of radiology, where a perplexity score of 59.55 and word error rate of 1.4% have been achieved.

Year	Venue	Keywords
2022	2022 30th European Signal Processing Conference (EUSIPCO)	the Serbian language,language modeling,class-based LM,code switching
DocType	ISSN	ISBN
Conference	2219-5491	978-1-6654-6799-5
Citations	PageRank	References
0	0.34	11
Authors
5

Authors (5 rows)

Cited by (0 rows)

References (11 rows)

Name	Order	Citations	PageRank
Edvin Pakoci	1	0	0.34
Darko Pekar	2	0	0.34
Branislav M. Popovic	3	96	17.13
Milan Sečujski	4	2	0.92
Vlado Delić	5	52	12.26

1