Title
Speech and Language Processing for Multimodal Human-Computer Interaction
Abstract
In this paper, we describe our recent work at Microsoft Research, in the project codenamed Dr. Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present in detail MiPad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones. Despite its current incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our user study reported in this paper. We describe in this system-oriented paper the main components of MiPad, with a focus on the robust speech processing and spoken language understanding aspects. The detailed MiPad components discussed include: distributed speech recognition considerations for the speech processing algorithm design; a stereo-based speech feature enhancement algorithm used for noise-robust front-end speech processing; Aurora2 evaluation results for this front-end processing; speech feature compression (source coding) and error protection (channel coding) for distributed speech recognition in MiPad; HMM-based acoustic modeling for continuous speech recognition decoding; a unified language model integrating context-free grammar and N-gram model for the speech decoding; schema-based knowledge representation for the MiPad's personal information management task; a unified statistical framework that integrates speech recognition, spoken language understanding and dialogue management; the robust natural language parser used in MiPad to process the speech recognizer's output; a machine-aided grammar learning and development used for spoken language understanding for the MiPad task; Tap & Talk multimodal interaction and user interface design; back channel communication and MiPad's error repair strategy; and finally, user study results that demonstrate the superior throughput achieved by the Tap & Talk multimodal interaction over the existing pen-only PDA interface. These user study results highlight the crucial role played by speech in enhancing the overall user experience in MiPad-like human-computer interaction devices.
Year
DOI
Venue
2004
10.1023/B:VLSI.0000015095.19623.73
Journal of Signal Processing Systems
Keywords
Field
DocType
speech-centric multimodal interface,human-computer interaction,robust speech recognition,SPLICE algorithm,denoising,online noise estimation,distributed speech processing,speech feature encoding,error protection,spoken language understanding,automatic grammar learning,semantic schema,user study
Speech corpus,Speech processing,Multimodal interaction,Speech analytics,Computer science,Speech recognition,Human–computer interaction,User interface,Speech technology,Spoken language,Language model
Journal
Volume
Issue
ISSN
36
2/3
0922-5773
Citations 
PageRank 
References 
4
0.42
10
Authors
9
Name
Order
Citations
PageRank
Deng, Li19691728.14
yeyi wang240.42
Kuansan Wang3131095.70
A. Acero47511.56
Hsiao-Wuen Hon51719354.37
james g droppo640.42
C. Boulis7423.67
milind mahajan8344.13
Xuedong Huang91390283.19