Title
TITGT at TRECVID 2009 workshop
Abstract
First, we extract SIFT features from all the image frames in each shot. This multi-frame technique is expected to perform well especially when objects are taken from different angles. Then, we model SIFT features extracted in each shot by a GMM. We call the resulting GMMs SIFT GMMs. They are expected to be more robust against quantization errors that occur in hard-assignment clustering in the Bag-of-Keypoints approach. Furthermore, they also have variance information of SIFT features. The expectation-maximization (EM) algorithm is often used to estimate parameters of GMMs. However, there may not be enough SIFT features in each shot to precisely estimate parameters. Hence, we estimate the parameters of a SIFT GMM by using a maximum a posteriori (MAP) adaptation technique in which the priori distribution is the SIFT GMM estimated using all of the videos. We classify shots by using support vector machines (SVMs) with the radial basis function (RBF) kernel, where the distance between SIFT GMMs is defined as the weighted sum of the Mahalanobis distances between the corresponding mixture components. 2. Acoustic features As acoustic features, we extract mel-frequency cepstrum coefficients (MFCCs), which are widely used in speech recognition. We model each HLF using an ergodic hidden Markov model (HMM). We also make an HMM for all the HLFs as the universal background model (UBM) and use the likelihood ratio between the target HLF model and the UBM for detection.
Year
Venue
Keywords
2009
TRECVID
Maximum a posteriori estimation,Hidden Markov model,Mahalanobis distance,Support vector machine,Cluster analysis,Scale-invariant feature transform,Kernel (linear algebra),TRECVID,Pattern recognition,Speech recognition,Engineering,Artificial intelligence
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Nakamasa Inoue195.02
Shanshan Hao221.41
Tatsuhiko Saito371.88
Koichi Shinoda446365.14
Ilseo Kim51007.68
Chin-Hui Lee66101852.71