Title
Overcoming Data Sparsity in Automatic Transcription of Dictated Medical Findings
Abstract
This paper presents a method for introducing class n-gram language models as a means for overcoming data sparsity in the training of an automatic speech recognition (ASR) system aimed at transcription of dictated medical findings composed predominantly in the Serbian language, including occasional phrases in Latin. The classes used by the model are defined with the specific aim of avoiding the need of identifying an appropriate orthographic expansion of each abbreviation, number or other non-orthographic element in a particular context. Generated language models are decoded in Kaldi using token passing, and lattices generated in this way are rescored using recurrent neural network language models (RNNLM). Although the proposed approach requires extensive effort for initial definition of classes based on existing text corpora of medical findings, it improves the quality of the model and increases the degree of automation in the processing of future training corpora. As such, the proposed method is particularly suitable for training on noisy data, full of misspel-lings and other errors, such as medical findings. The feasibility of the approach has been tested on a corpus of medical findings in the domain of radiology, where a perplexity score of 59.55 and word error rate of 1.4% have been achieved.
Year
Venue
Keywords
2022
2022 30th European Signal Processing Conference (EUSIPCO)
the Serbian language,language modeling,class-based LM,code switching
DocType
ISSN
ISBN
Conference
2219-5491
978-1-6654-6799-5
Citations 
PageRank 
References 
0
0.34
11
Authors
5
Name
Order
Citations
PageRank
Edvin Pakoci100.34
Darko Pekar200.34
Branislav M. Popovic39617.13
Milan Sečujski420.92
Vlado Delić55212.26