Title
Automatic Estimation Of Language Model Parameters For Unseen Words Using Morpho-Syntactic Contextual Information
Abstract
Various information sources naturally contains new words that appear in a daily basis and which are not present in the vocabulary of the speech recognition system but are important for applications such as closed-captioning or information dissemination. To be recognized, those words need to be included in the vocabulary and the language model (LM) parameters updated. In this context, we propose a new method that allows including new words in the vocabulary even if no well suited training data is available, as is the case of archived documents, and without the need of LM retraining. It uses morpho-syntatic information about an in-domain corpus and part-of-speech word classes to define a new LM unigram distribution associated to the updated vocabulary.Experiments were carried out for a European Portuguese Broadcast News transcription system. Results showed a relative reduction of 4% in word error rate, with 78% of the occurrences of those newly included words being correctly recognized.
Year
Venue
Keywords
2008
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5
morpho-syntactic analysis, POS tags, class-based language models, broadcast news, transcription systems
Field
DocType
Citations 
European Portuguese,Broadcasting,Computer science,Word error rate,Speech recognition,Natural language processing,Artificial intelligence,Information Dissemination,Syntax,Vocabulary,Language model,Retraining
Conference
3
PageRank 
References 
Authors
0.50
7
3
Name
Order
Citations
PageRank
Ciro Martins110011.90
António J. S. Teixeira215235.26
João Paulo Neto329132.69