Abstract | ||
---|---|---|
Most speech recognition systems rely on pronunciation dictionaries to provide accurate transcriptions. Typically, some pronunciations are carved manually, but many are produced using pronunciation learning algorithms. Successful algorithms must have the ability to generate rich pronunciation variants, e.g. to accommodate words of foreign origin, while being robust to artifacts of the training data, e.g. noise in the acoustic segments from which the pronunciations are learned if the method uses acoustic signals. We propose a general finite-state transducer (FST) framework to describe such algorithms. This representation is flexible enough to accommodate a wide variety of pronunciation learning algorithms, including approaches that rely on the availability of acoustic data, and methods that only rely on the spelling of the target words. In particular, we show that the pronunciation FST can be built from a recurrent neural network (RNN) and tuned to provide rich yet constrained pronunciations. This new approach reduces the number of incorrect pronunciations learned from Google Voice traffic by up to 25% relative. |
Year | DOI | Venue |
---|---|---|
2017 | 10.21437/Interspeech.2017-47 | 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION |
Keywords | Field | DocType |
speech recognition, pronunciation learning | Pronunciation,Computer science,Speech recognition | Conference |
ISSN | Citations | PageRank |
2308-457X | 0 | 0.34 |
References | Authors | |
7 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Antoine Bruguier | 1 | 6 | 3.50 |
Danushen Gnanapragasam | 2 | 0 | 0.34 |
Leif Johnson | 3 | 37 | 4.34 |
Kanishka Rao | 4 | 189 | 11.94 |
Françoise Beaufays | 5 | 27 | 2.84 |