Title
A speaker diarization method based on the probabilistic fusion of audio-visual location information
Abstract
This paper proposes a speaker diarization method for determining ""who spoke when"" in multi-party conversations, based on the probabilistic fusion of audio and visual location information. The audio and visual information is obtained from a compact system designed to analyze round table multi-party conversations. The system consists of two cameras and a triangular microphone array with three microphones, and can cover a spherical region. Speaker locations are estimated from audio and visual observations in terms of azimuths from this recording system. Unlike conventional speech diarization methods, our proposed method estimates the probability of the presence of multiple simultaneous speakers in a physical space with a small microphone setup instead of using a cascade consisting of speech activity detection, direction of arrival estimation, acoustic feature extraction, and information criteria based speaker segmentation. To estimate the speaker presence more correctly, the speech presence probabilities in a physical space are integrated with the probabilities estimated from participants' face locations obtained with a robust particle filtering based face tracker with two cameras equipped with fisheye lenses. The locations in a physical space with highly integrated probabilities are then classified into a certain number of speaker classes by using on-line classification to realize speaker diarization. The probability calculations and speaker classifications are conducted on-line, making it unnecessary to observe all the conversation data. An experiment using real casual conversations, which include more overlaps and short speech segments than formal meetings, showed the advantages of the proposed method.
Year
DOI
Venue
2009
10.1145/1647314.1647327
ICMI
Keywords
Field
DocType
audio-visual location information,speaker classification,speaker location,speaker presence,speaker segmentation,speaker diarization,multiple simultaneous speaker,probabilistic fusion,speaker diarization method,speaker class,physical space,feature extraction,system design,speech segmentation,particle filter,speech activity detection
Computer vision,Voice activity detection,Computer science,Particle filter,Microphone array,Feature extraction,Speech recognition,Speaker recognition,Artificial intelligence,Speaker diarisation,Probabilistic logic,Microphone
Conference
Citations 
PageRank 
References 
6
0.46
32
Authors
5
Name
Order
Citations
PageRank
Kentaro Ishizuka117415.77
Shoko Araki21726158.79
Kazuhiro Otsuka361954.15
Tomohiro Nakatani41327139.18
Masakiyo Fujimoto539334.28