Title
Enhancing Low Resource Keyword Spotting With Automatically Retrieved Web Documents
Abstract
Keyword Spotting (KWS) systems developed for low resource languages with very little transcribed audio suffer due to a small vocabulary (high out-of-vocabulary (OOV) rate) and a weak language model. In this paper, we propose to augment such systems using automatically retrieved web documents. Our procedure can find large volumes of web documents similar to a small pool of training transcriptions within a few hours, by querying a search engine with automatically generated query terms. We then use simple language identification to extract high-confidence text for lexicon expansion and language modeling. Experiments using six very limited language packs (VLLP) from the IARPA-Babel program show web documents can cut the OOV rate by half on the development set, and on average improve keyword spotting performance by 2.8 points absolute measured by the Actual Term Weighted Value (ATWV). In particular, we find most of the gains (8.7 points on average) are from keywords that were OOV in the baseline system, and are converted into in-vocabulary (IV) through lexicon expansion. These gains are obtained even after using subword units (unsupervised syllable-like units and sequences of phones), which are known to greatly enhance OOV keyword search performance.
Year
Venue
Keywords
2015
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5
web document retrieval, keyword spotting, language modeling
Field
DocType
Citations 
Keyword density,Information retrieval,Computer science,Speech recognition,Keyword spotting
Conference
7
PageRank 
References 
Authors
0.47
6
6
Name
Order
Citations
PageRank
Le Zhang126832.16
Damianos Karakos222119.35
William Hartmann36410.66
Roger Hsiao4573.32
Richard M. Schwartz52839717.76
Stavros Tsakalidis621313.83