Title | ||
---|---|---|
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization |
Abstract | ||
---|---|---|
This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings. |
Year | DOI | Venue |
---|---|---|
2008 | 10.1145/1452392.1452446 | ICMI |
Keywords | Field | DocType |
high-resolution omnidirectional image,novel omnidirectional camera-microphone system,face position,speaker diarization,multiple face,microphone array,group meeting,realtime system,novel tabletop,realtime robust tracking,face tracker,realtime multimodal system,face tracking,frames per second,voice activity detection,high resolution | Computer vision,Computer science,Voice activity detection,Visualization,Microphone array,Speech recognition,Frame rate,Artificial intelligence,Speaker diarisation,Fisheye lens,Audio signal processing,Facial motion capture | Conference |
Citations | PageRank | References |
41 | 2.45 | 16 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Kazuhiro Otsuka | 1 | 619 | 54.15 |
Shoko Araki | 2 | 1726 | 158.79 |
Kentaro Ishizuka | 3 | 174 | 15.77 |
Masakiyo Fujimoto | 4 | 393 | 34.28 |
Martin Heinrich | 5 | 41 | 2.45 |
Junji Yamato | 6 | 1120 | 165.72 |