The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings - Citegraph

Paper Info

Title
The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings

Abstract
The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones --- the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.

Year	DOI	Venue
2007	10.1007/978-3-540-68585-2_40	CLEAR
Keywords	Field	DocType
ibm rich transcription,rt06s evaluation test set,speech-to-text systems,stt word error rate,stt transcript,lecture meetings,stt rich transcription spring,mdm condition,stt task,stt ihm condition,rt07 evaluation campaign,stt wer,large language model	Headset,Word error rate,Speech recognition,NIST,Speaker diarisation,Engineering,Microphone,Language model,Test set,Acoustic model	Conference
Volume	ISSN	Citations
4625	0302-9743	1
PageRank	References	Authors
0.44	12	5

Authors (5 rows)

Cited by (1 rows)

References (12 rows)

Name	Order	Citations	PageRank
Jing Huang	1	2464	186.09
Etienne Marcheret	2	100	11.15
Karthik Visweswariah	3	400	38.22
Vit Libal	4	32	4.32
Gerasimos Potamianos	5	1113	113.80

1