Title
IMPROVED DATA SELECTION FOR DOMAIN ADAPTATION IN ASR
Abstract
Automatic speech recognition (ASR) systems are highly sensitive to train-test domain mismatch. However, because transcription is often prohibitively expensive, it is important to be able to make use of available transcribed out-of-domain data. We address the problem of domain adaptation with semi-supervised training (SST). Contrary to work in in-domain SST, we find significant performance improvement even with just one hour of target-domain data-though, the selection of the data is critical. We show that minimum phone error rate is a good oracle measure for selection, and we approximate this measure by using the average phone confidence of an utterance. With larger domain shifts, we also find that deletions and low lexical diversity are a serious issue, which we address by incorporating phone rate into our selection metric. With our proposed selection criterion, we see up to 57% relative improvements over the out-of-domain baseline model. Furthermore, this selection method generalizes well, and matches or outperforms word-level confidence selection across six separate domain shift conditions.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9413869
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
Domain adaptation, data selection, semi-supervised training
Conference
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Shannon Wotherspoon100.34
William Hartmann26410.66
Matthew Snover300.34
Owen Kimball48317.82