Abstract | ||
---|---|---|
In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/icassp.2019.8683277 | 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) |
Keywords | Field | DocType |
Speech coding, low bit-rate, generative models, WaveNet, VQ-VAE | Training set,Low bit rate,Speech coding,Pattern recognition,Computer science,Neural network architecture,Coding (social sciences),Artificial intelligence,Codec | Conference |
ISSN | Citations | PageRank |
1520-6149 | 0 | 0.34 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Cristina Garbacea | 1 | 4 | 1.10 |
Aäron Van Den Oord | 2 | 1585 | 64.43 |
Yazhe Li | 3 | 40 | 1.65 |
Felicia Lim | 4 | 35 | 5.70 |
Alejandro Luebs | 5 | 2 | 2.05 |
Oriol Vinyals | 6 | 9419 | 418.45 |
Thomas C Walters | 7 | 0 | 0.68 |