Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora - Citegraph

Paper Info

Title
Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

Abstract
The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T2-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus.

Year	DOI	Venue
2006	10.1109/TSA.2005.858057	IEEE Transactions on Audio, Speech & Language Processing
Keywords	Field	DocType
audio segmentation,new algorithm,new distance metric,new false alarm compensation,broadcast news,new proposed feature,audio classification,ngsw corpus,zero-crossing rate,new classification,unsupervised audio classification,new extended-time feature,wgn classification algorithm,bayesian information criterion,spectrum,information retrieval,audio signal processing,unsupervised learning,classification algorithms,speaker recognition,filter bank,mel frequency cepstral coefficient,change detection,speech processing,automatic speech recognition,gaussian mixture model,speech recognition,feature analysis,robustness,broadcasting,gaussian processes,distance metric,false alarm rate	Mel-frequency cepstrum,False alarm,Pattern recognition,Computer science,Segmentation,Speech recognition,Speaker recognition,Unsupervised learning,Artificial intelligence,Constant false alarm rate,Audio signal processing,Cable television	Journal
Volume	Issue	ISSN
14	3	1558-7916
Citations	PageRank	References
35	1.73	24
Authors
2

Authors (2 rows)

Cited by (35 rows)

References (24 rows)

Name	Order	Citations	PageRank
Rongqing Huang	1	141	10.27
J. H.L. Hansen	2	35	1.73

1