Motivation
In a highly dynamic user-computer interaction system as we support, it is difficult to predict how the interaction would unfold. It is thus impractical to plan in advance the content and forms of all possible system responses.
Approach
To tailor system responses to a user interaction context, we have developed an intelligent multimedia presentation framework, called YOGA (Your Oration and Graphics Automation). YOGA consists of three key modules in supporting the creation of a tailored, multimedia system output: 1) content selection that dynamically decides the proper response content (e.g., what a sub-set of data attributes to present), 2) media allocation that allocates the most suitable presentation media such as speech or graphics to convey the intended content, and 3) media-specific designers that design the most effective media-specific presentation form (e.g., graphical or verbal output). More importantly, we take practical issues into account when developing YOGA technologies to achieve desired system coverage and extensibility. Specifically, we devise an optimization-based framework to select response content and allocate suitable media. We combine machine learning with other approaches to dynamically synthesize verbal and visual responses. In addition to tailoring responses to individual requests, our approaches also leverage user input patterns to customize responses to a specific user interaction flow.
Optimization-based content and media selection
A user interaction context consists of a number of factors, including query expressions and conversation history. Any subtle variations of these factors, such as changes in data volume or query patterns, often require different response content or presentation media to be used. Since it may require a huge set of rules/plans to handle such diverse situations exhaustively, it is impractical to use conventional rule/plan-based approaches. Instead, we develop an optimization-based framework for content selection and media allocation, called IMPRESA (Intelligent Multimedia PRESentation Authoring) [Zhou-UIST04, Zhou-IUI05]. In this framework, we uniformly model all factors as presentation desirability/cost constraints (e.g., a presentation cost constraint derived from device properties). We then use optimization-based algorithms to maximize the satisfaction of all constraints. For example, we use a graph-matching algorithm to allocate media by maximizing the satisfaction of all allocation constraints [Zhou-IUI05]. As a result, our work optimizes the content and media selection by dynamically balancing a comprehensive set of factors. Moreover, our approaches can be easily extended to cover new situations, since adding a new constraint does not require modification of the underlying algorithms.
Example-based media-specific design
Similar to the reasons listed above, it is also impractical to use a rule/plan based approach to create media-specific outputs. Instead, we employ case-based learning to create both visual and verbal responses from a set of graphics and English sentence examples, respectively.
In particular, we have developed a natural speech generation system, called SEGUE (Spoken English Generation Using Examples), which can automatically produce natural and coherent spoken utterances in a conversational setting [Pan-INLG04]. SEGUE uses an existing sentence corpus to dynamically decide proper words, utterance structures, and sentence boundaries [Pan-ACL05] (e.g., how to generate multiple sentences to convey the intended content).
Similar to SEGUE, we have developed IMPROVISE*, which is an extension of a previous planning-based graphics generation system IMPROVISE, to produce coherent, rich visual output. Unlike IMPROVISE, IMPROVISE* can cover much more dynamic user interaction situations and deal with large and complex data sets by learning from and combining existing graphics examples [Zhou-IJCAI03].
Our learning engines for both SEGUE and IMPROVISE* not only can reuse suitable examples, but they can also compose new forms of outputs by dynamically combining different example fragments [Pan-INLG04,Zhou-IJCAI03]. As a result, we can cover a wide range of interaction situations using only a small number of examples. For example, IMPROVISE* uses about 20 visual examples and SEGUE uses around 200 sentence examples for our real estate application that covers 25+ concepts, each with a number of attributes (e.g., a house has 40 attributes). The usage of a small example set helps to set up a system quickly. Moreover, we can easily extend a system's capability by adding new examples. Nonetheless, a case-based learning engine alone is inadequate in meeting all our needs. For example, it is inefficient to use case-based learning to abstract sentence aggregation rules, since it would require a large number of examples. Similarly, case-based learning is inefficient in learning precise visual arrangements (e.g., exact positions and sizes). Thus, we use case-based learning to learn overall presentation structures (e.g., visual or sentence structure) and use other approaches to fine tune presentation details (e.g., layout).
Context-sensitive output design
A better understanding of a user input helps to create a more tailored response. In particular, YOGA leverages our fine-grained interpretation results produced by TAICHI to tailor its responses to a specific user interaction flow. Here we illustrate a couple of examples of use of such results.
Feature followup is derived during TAICHI's input interpretation to signal whether a given user request is new or a continuation of a previous request. In Figure 1, U2 is a follow-on of U1, since it inherits certain data constraints specified in U1. To maintain semantic continuity between follow-up requests, the visual designer uses this feature to compute the amount of visual content overlap between two successive visual responses. In general, YOGA maximizes the overlap between follow-on requests, while reducing the overlap when a new flow starts [Wen-InfoVis05].
U1: 4 bedroom, 2 bathrm colonials R1: I found 47 houses satisfying your criteria U2: at least 2000 sq.ft. R2: There are 26 houses U3: built after 1990 R3: 1 house R3’: Based on your request, I have narrowed down to 1 house U4: what about any style R4: I found 3 houses U5: Tell me about the schools |
Another derived input feature navDirection, indicating the change of direction in user data navigation, also influences YOGA response creation. When exploring a data space, a user may change his data foci in several ways (Figure 1): filtering a data set (U3), expanding a data set (U4), or switching to a different data set (U5). To tailor YOGA output to a user interaction flow, both our visual and verbal designers exploit this feature. First, navDirection helps the visual designer to decide the amount of visual context to be maintained between displays. For example, if the system detects that a user is narrowing down a data set, it will reduce visual content overlap across displays to let users focus on the filtered data set [Wen-InfoVis05]. Likewise, this feature helps the language designer to decide how much information it should repeat in successive verbal responses. To avoid repetitions, the language designer generates progressively more terse expressions, such as ellipses, in response to a series of similar requests (R3). It could also use this feature to generate more informative responses, confirming the current navigation direction (R3’).
Publications
- Shimei Pan and James C. Shaw. Instance-based Sentence Boundary Determination by Optimization. Proceedings of Association for Computational Linguistics (ACL), 2005.
- Zhen Wen, Michelle X. Zhou and Vikram Aggarwal. An Optimization-based Approach to Dynamic Visual Context Management. Proceedings of IEEE Symposium on Information Visualization (InfoVis), 2005.
- Michelle X. Zhou, Zhen Wen and Vikram Aggarwal. A Graph-Matching Approach to Dynamic Media Allocation in Intelligent Multimedia Interfaces. Proceedings of ACM Conference on Intelligent User Interfaces (IUI), pages 114-121, 2005. Best paper award.
- Shimei Pan and James C. Shaw. SEGUE: A Hybrid Case-Based Surface Natural Language Generator. Proceedings of International Conference of Natural Language Generation (INLG), pages 130-140, 2004.
- Michelle X. Zhou and Vikram Aggarwal. An Optimization-based Approach to Dynamic Data Content Selection in Intelligent Multimedia Interfaces. Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pages 227-236, 2004.
- Michelle X. Zhou and Min Chen. Automated Generation of Graphical Sketches by Example. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pages 65-74, 2003.
