Title | ||
---|---|---|
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems |
Abstract | ||
---|---|---|
Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data. |
Year | DOI | Venue |
---|---|---|
2022 | 10.21437/INTERSPEECH.2022-696 | Conference of the International Speech Communication Association (INTERSPEECH) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mingyu Cui | 1 | 0 | 2.03 |
Jiajun Deng | 2 | 0 | 1.69 |
Shoukang Hu | 3 | 6 | 10.90 |
Xurong Xie | 4 | 6 | 8.57 |
Tianzi Wang | 5 | 0 | 2.03 |
Shujie Hu | 6 | 0 | 1.35 |
Mengzhe Geng | 7 | 1 | 5.42 |
Boyang Xue | 8 | 0 | 0.68 |
Xunying Liu | 9 | 330 | 52.46 |
Helen M. Meng | 10 | 1078 | 172.82 |