Title | ||
---|---|---|
TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context |
Abstract | ||
---|---|---|
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/ICASSP43922.2022.9746806 | IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nithin Rao Koluguri | 1 | 0 | 1.01 |
Taejin Park | 2 | 0 | 0.68 |
Ginsburg, Boris | 3 | 75 | 8.77 |