9.8 A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET - Citegraph

Paper Info

Title
9.8 A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

Abstract
Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4-7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind deep neural network (DNN) algorithms are adequate. However, for large-vocabulary tasks (e.g., >100k words), the more complex bidirectional RNNs with an attention mechanism [1] provide context learning in long sequences, which improve ASR accuracy by up to 62% on the 200kwords LibriSpeech dataset, compared to a simpler unidirectional RNN (Fig. 9.8.1). Attention-based networks emphasize the most relevant parts of the source sequence during each decoding time step. In doing so, the encoder sequence is treated as a soft-addressable memory whose positions are weighted based on the state of the decoder RNN. Bidirectional RNNs learn past and future temporal information by concatenating forward and backward time steps.

Year	DOI	Venue
2021	10.1109/ISSCC42613.2021.9366062	2021 IEEE International Solid- State Circuits Conference (ISSCC)
Keywords	DocType	Volume
SoC,IoT devices,bayesian speech denoising,sequence-to-sequence DNN speech recognition,FinFET,automatic speech recognition,deep learning,user interfaces,ASR chips,realistic operating conditions,small-vocabulary tasks,large-vocabulary tasks,complex bidirectional RNNs,attention mechanism,context learning,long sequences,ASR accuracy,200kwords LibriSpeech dataset,attention-based networks,source sequence,encoder sequence,context-blind deep neural network,noise-robust speech-to-text latency,bidirectional RNN,decoder RNN,soft-addressable memory,time 18.0 ms,size 16.0 nm	Conference	64
ISSN	ISBN	Citations
0193-6530	978-1-7281-9550-6	3
PageRank	References	Authors
0.43	0	10

Authors (10 rows)

Cited by (3 rows)

References (0 rows)

Name	Order	Citations	PageRank
Thierry Tambe	1	18	3.43
En-Yu Yang	2	10	2.31
Glenn G. Ko	3	10	3.30
Yuji Chai	4	5	2.16
Coleman Hooper	5	7	1.17
Marco Donato	6	31	5.83
Paul N. Whatmough	7	147	20.59
Alexander M. Rush	8	1499	67.53
David Brooks	9	5518	422.08
Gu-Yeon Wei	10	1927	214.15

1