A speaker diarization method based on the probabilistic fusion of audio-visual location information - Citegraph

Paper Info

Title
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Abstract
This paper proposes a speaker diarization method for determining ""who spoke when"" in multi-party conversations, based on the probabilistic fusion of audio and visual location information. The audio and visual information is obtained from a compact system designed to analyze round table multi-party conversations. The system consists of two cameras and a triangular microphone array with three microphones, and can cover a spherical region. Speaker locations are estimated from audio and visual observations in terms of azimuths from this recording system. Unlike conventional speech diarization methods, our proposed method estimates the probability of the presence of multiple simultaneous speakers in a physical space with a small microphone setup instead of using a cascade consisting of speech activity detection, direction of arrival estimation, acoustic feature extraction, and information criteria based speaker segmentation. To estimate the speaker presence more correctly, the speech presence probabilities in a physical space are integrated with the probabilities estimated from participants' face locations obtained with a robust particle filtering based face tracker with two cameras equipped with fisheye lenses. The locations in a physical space with highly integrated probabilities are then classified into a certain number of speaker classes by using on-line classification to realize speaker diarization. The probability calculations and speaker classifications are conducted on-line, making it unnecessary to observe all the conversation data. An experiment using real casual conversations, which include more overlaps and short speech segments than formal meetings, showed the advantages of the proposed method.

Year	DOI	Venue
2009	10.1145/1647314.1647327	ICMI
Keywords	Field	DocType
audio-visual location information,speaker classification,speaker location,speaker presence,speaker segmentation,speaker diarization,multiple simultaneous speaker,probabilistic fusion,speaker diarization method,speaker class,physical space,feature extraction,system design,speech segmentation,particle filter,speech activity detection	Computer vision,Voice activity detection,Computer science,Particle filter,Microphone array,Feature extraction,Speech recognition,Speaker recognition,Artificial intelligence,Speaker diarisation,Probabilistic logic,Microphone	Conference
Citations	PageRank	References
6	0.46	32
Authors
5

Authors (5 rows)

Cited by (6 rows)

References (32 rows)

Name	Order	Citations	PageRank
Kentaro Ishizuka	1	174	15.77
Shoko Araki	2	1726	158.79
Kazuhiro Otsuka	3	619	54.15
Tomohiro Nakatani	4	1327	139.18
Masakiyo Fujimoto	5	393	34.28

1