Title
Improving Speech Recognition And Keyword Search For Low Resource Languages Using Web Data
Abstract
We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conversational telephone speech and obtain large reductions in out-of-vocabulary keywords. Furthermore, we show that the web data can improve Term Error Rate Performance by 3.8% absolute and Maximum Term-Weighted Value in Keyword Search by 0.0076-0.1059 absolute points. Much of the gain comes from the reduction of out-of-vocabulary items.
Year
Venue
Keywords
2015
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5
web resources, web scraping, keyword search, low-resource languages
Field
DocType
Citations 
Computer science,Word error rate,Keyword search,Speech recognition,Natural language processing,Artificial intelligence,Augment,Language model
Conference
7
PageRank 
References 
Authors
0.49
15
8
Name
Order
Citations
PageRank
Gideon Mendels1111.65
Erica Cooper2514.19
Victor Soto381.55
Julia Hirschberg42982448.62
Mark J. F. Gales53905367.45
Kate Knill624928.02
Anton Ragni7989.06
Haipeng Wang8404.25