Title
Listen, Attend and Spell
Abstract
We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.
Year
Venue
Field
2015
CoRR
Computer science,Filter bank,Word error rate,Speech recognition,Artificial intelligence,Natural language processing,Encoder,Spell,Artificial neural network,Voice search,Machine learning,Language model
DocType
Volume
Citations 
Journal
abs/1508.01211
50
PageRank 
References 
Authors
2.43
10
4
Name
Order
Citations
PageRank
William Chan135724.67
Navdeep Jaitly22988166.08
Quoc V. Le38501366.59
Oriol Vinyals49419418.45