Title
The IBM Rich Transcription 2007 Speech-to-Text Systems for Lecture Meetings
Abstract
The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech-to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM) ones --- the latter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 evaluation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.
Year
DOI
Venue
2007
10.1007/978-3-540-68585-2_40
CLEAR
Keywords
Field
DocType
ibm rich transcription,rt06s evaluation test set,speech-to-text systems,stt word error rate,stt transcript,lecture meetings,stt rich transcription spring,mdm condition,stt task,stt ihm condition,rt07 evaluation campaign,stt wer,large language model
Headset,Word error rate,Speech recognition,NIST,Speaker diarisation,Engineering,Microphone,Language model,Test set,Acoustic model
Conference
Volume
ISSN
Citations 
4625
0302-9743
1
PageRank 
References 
Authors
0.44
12
5
Name
Order
Citations
PageRank
Jing Huang12464186.09
Etienne Marcheret210011.15
Karthik Visweswariah340038.22
Vit Libal4324.32
Gerasimos Potamianos51113113.80