Multimodal Input

Motivation



It is highly desirable to have a robust and accurate input interpretation engine that can understand diverse user expressions in context. Since building a general purpose interpretation engine is very difficult, we focuses only on understanding user inputs for information-seeking applications. Our current multimodal input interpretation framework is called TAI-CHI (Two-Way Adaptative Interpretation for Context-sensitive Human-computer Interaction).

Approach



We employ three complementary strategies to enable robust and accurate interpretation of diverse user data requests in context. First, we use a context-driven approach to optimize TAI-CHI interpretation of multimodal inputs by exploiting various contexts simultaneously. Second, we provide system guidance in context to allow user and TAI-CHI to adapt to each other's expressions over time. Third, we leverage the strength of multiple modalities to achieve robust interpretation.

Context-driven interpretation


Currently, TAI-CHI focuses on user requests to databases. As we have observed (e.g., from our WOZ study), while these requests exhibit substantial syntactic variations, they share a common semantic structure. Based on this observation, we use a set of semantic constructs to model a user request. Specifically, a user request includes two top-level constructs: intention and attention. Intention encodes the user information- seeking task (e.g., data access or comparison). Attention captures the data target of the intention, made up of lower-level constructs, such as data concepts/attributes to be retrieved, and a set of constraints that the retrieved data must satisfy. It also includes derived meta features that characterize the overall properties of a request. Such meta features are used to tailor TAI-CHI responses to the query context. To interpret an input, TAI-CHI first identifies various semantic constructs using a lexicon that is largely derived automatically from the databases. TAI-CHI then resolves references and semantic ambiguities by uniformly modeling contextual cues as a set of constraints, including conversation history, data semantics, and syntactic information derived from a syntactic parser. It then uses optimization-based approaches to derive the most probable interpretation of an input by maximizing the satisfaction of all constraints. As a result, TAI-CHI is able to consider all constraints simultaneously to optimize interpretation. As a result, TAI-CHI can handle a wide range of user expressions regardless their syntactic forms, ranging from keywords (e.g., "colonials 3+ bedrooms") to full English sentences, all in context. Such flexibility is much appreciated in a practical application, where TAI-CHI must accommodate various user linguistic styles, and tolerate imperfect user inputs (e.g., abbreviated and ungrammatical expressions). Moreover, our approach helps to minimize the effort for supporting new domains, since it does not require a large training corpus or a large set of syntactic rules.

Adaptive interpretation


Despite our effort described above to help achieve more accurate and robust interpretation, TAI-CHI’s interpretation capability may still be insufficient for our targeted, real-world applications. Instead of directly improving TAI-CHI interpretation capability in a conventional way, we build a two-way adaptation engine that allows both users and TAI-CHI to dynamically adapt to each other’s expressions in the course of interaction [Pan-IUI05]. Consequently, the adaptation enhances the usability of TAI-CHI by turning a novice user into a power user, who can work effectively within TAI-CHI’s capability. Moreover, TAI-CHI improves its interpretation capability through self-adaptation, minimizing the overall effort of developing an effective interaction system.

Leveraging GUIs and language inputs


Besides combining language inputs and deictic gestures as in other systems, we have explored the usage of GUIs to complement language inputs for two reasons. First, it is easier for users to use GUIs to express certain data requests (e.g., using a slider for dynamic data query). Second, GUI inputs are explicit and thus help TAI-CHI to process the accompanying language input. By default, TAI-CHI interprets a user request in the context of previous requests. However, users may break from the previous conversation flow without explicitly signaling using language cues. While TAI-CHI is able to detect some of these breaks, it also lets users use a GUI button to explicitly signal the start of a new flow. In fact, users can use different GUI buttons to control a conversation flow, including interrupting a TAI-CHI response (barging in), starting over (wiping out the entire conversation




Publications



  • Joyce Chai, Shimei Pan and Michelle X. Zhou. MIND: A Context-based Multimodal Interpretation Framework in Conversation Systems. Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, J. Kuppervelt, L. Dybkjaer and N. Bernsen (eds). Kluwer. 2005. To appear.
  • Shimei Pan, Siwei Shen, Michelle X. Zhou and Keith Houck. Two-Way Adaptation for Robust Input Interpretation for Practical Multimodal Interaction. Proceedings of ACM Conference on Intelligent User Interfaces (IUI), pages 25-32, 2005.
  • Joyce Chai, Pengyu Hong and Michelle X. Zhou. A Probabilistic Approach to Reference Resolution in Multimodal Interfaces. Proceedings of ACM Conference on Intelligent User Interfaces (IUI), pages 70-77, 2004.
  • Joyce Chai, Pengyu Hong, Michelle X. Zhou and Zahar Prasov. Optimization in Multimodal Interpretation. Proceedings of Association of Computational Linguistics (ACL), pages 1-8, 2004.
  • Keith Houck. Contextual Revision in Information Seeking Conversation Systems. Proceedings of International Conference on Spoken Language Processing (ICSLP), 2004.
  • Joyce Chai. Semantics-based Representation for Multimodal Interpretation in Conversational Systems. Proceedings of International Conference on Computational Linguistics (COLING), 2002.
  • Joyce Chai. Operations for Context-based Multimodal Interpretation. Proceedings of International Conference on Spoken Language Processing (ICSLP), 2002.
  • Joyce Chai, Shimei Pan and Michelle X. Zhou. MIND: A Semantics-based Multimodal Interpretation Framework for Conversation Systems. Proceedings of International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialog Systems , 2002.
  • Joyce Chai, Shimei Pan, Michelle X. Zhou and Keith Houck. Context-based Multimodal Input Understanding in Conversational Systems. Proceedings of IEEE International Conference on Multimodal Interfaces (ICMI), pages 87-92, 2002.