Abstract | ||
---|---|---|
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems. |
Year | Venue | DocType |
---|---|---|
2014 | CoRR | Journal |
Volume | Citations | PageRank |
abs/1412.5567 | 185 | 8.06 |
References | Authors | |
21 | 11 |
Name | Order | Citations | PageRank |
---|---|---|---|
Awni Y. Hannun | 1 | 517 | 27.54 |
Carl Case | 2 | 437 | 16.75 |
Jared Casper | 3 | 824 | 34.12 |
Bryan C. Catanzaro | 4 | 1191 | 75.56 |
Gregory Frederick Diamos | 5 | 1117 | 51.07 |
Erich Elsen | 6 | 185 | 10.42 |
Ryan J. Prenger | 7 | 486 | 20.61 |
Sanjeev Satheesh | 8 | 5591 | 233.55 |
Shubho Sengupta | 9 | 505 | 19.84 |
Adam Coates | 10 | 2493 | 160.95 |
Andrew Y. Ng | 11 | 26065 | 1987.54 |