This page describes conversational solutions for telematics and similar embedded applications. It illustrates the latest in Embedded Natural Language (Free-Form) Interaction and Dialog Management technologies from IBM. This system enables users to converse naturally with devices such as a music player and GPS navigation system to accomplish their objectives easily and efficiently. Users can speak in their own words without having to navigate through complex menus or remember specific keywords.
1. High Level Architecture
Figure 1 shows the high level architecture of this system. It consists of the following core components:
• CIMA (Conversational Interaction Management Architecture)
• ASR (Automatic Speech Recognition) and TTS (Text- To-Speech) engines
• Generic interfaces for connecting to ASR and TTS
The following are application specific components:
• Application specific dialog strategy – dialog logic and prompts – that defines the call flow
• SLM (Statistical Language Model) and grammars for speech recognition
• MAC (Multi-level Action Classification) model for interpretation
• Custom interfaces to Music and GPS devices and the GUI
Figure 1
2. Natural Language Understanding
Natural language speech recognition is carried out using an SLM, with embedded grammars for variable items such as names of songs, artists, albums, cities, etc. (Picheny et al., 2005). Some of these embedded grammars get updated dynamically during the dialog session. Interpretation is carried out using a statisti-cal classifier that operates on speech recognition output. This approach makes the system robust to insig-nificant recognition errors.
The following are key features of IBM’s NLU technology:
• Free-form recognition of multiple items (tokens) in a single request. For e.g. "I want to hear so-nata number eleven by Mozart". Unlike other systems, there is no need to prefix tokens with key-word as in “I want to hear song sonata number eleven”.
• Excruciating menus are avoided by direct access to most functions, and without needing to re-member confusing context specific commands (McTear 2004).
• Free-form recognition of items from very large lists – thousands of cities in a country. Users can simply say, "Take me to the center of Boston" without having to first specify a State and then the city.
• Recognition and disambiguation of partial names, for streets, artists, etc., eliminating need to remember official names. Creating embedded grammars within an SLM is similar to inducing finite-state grammars from data (Dupont). The trick is to overgenerate the grammar in ways consistent with how people speak the names (Pico).
3. Conversational Interaction Management Architecture (CIMA)
CIMA provides a flexible architecture for multimodal Dialog Management and for component integra-tion. CIMA includes a general purpose state machine with a programmable interface for application domain specific dialog management. The Base Dialog Strategy serves as a template for common dialog management functions, and these are customized by writing application specific dialog logic. This is done using SCXML (State Chart XML). (SCXML07, SHALE07a, SHAE07b).
CIMA includes support for accepting asynchronous input from multiple devices and multiple modalities. For example, users can select an item from a disambiguation list using touch or key press in addition to voice input. CIMA is designed to easily port apps to multiple languages by pointing them to appropriate language specific resources. CIMA also serves as the central hub that coordinates between various sources of input and output, various devices, and any other needed interaction. SCXML was created for this exact purpose, and has been used in several multimodal architectures (Larson 2005, 2006, MMI06).
Key dialog features of this system include:
• Support for extensive ambiguity resolution, e.g., disambiguating artist, song, and album names, city and street names, etc.
• Three types of disambiguation situations are considered (in combination) – ambiguity in token value, ambiguity in combination of two or more tokens and N-best results (including free-form results) when confidence is low. The collective ambiguity is presented to the user for resolution (LUMEN).
• Users can restate requests to recover from errors without always having to start over.
• List items can be selected in multiple ways – by voice, by speaking item name, using selector phrase or by saying "that one" when desired item is heard, or using GUI.
• User Profiles to enable user specific prompts, custom POIs, favorite lists, etc.
4. Additional System Highlights
The following are additional system highlights:
• Handling dynamic content – dynamic loading and recognition of music tracks, location-specific POIs, etc.
• GPS data processing – Normalizing GPS data and grammar creation. Required resolving issues with spelling, abbreviations, and automatically identifying reasonable partial expansions.
• Porting to embedded platform –Required extensive tuning to run efficiently on less memory and fewer MIPS.
• Parallel development in multiple languages – Dealing with language specific issues with single dialog strategy, for e.g., handling morphological problems not present in English such as agree-ment of gender with ordinals, POI names etc.
5. System Functionality and Screenshots
This section shows some screenshots and lists main functions of the GPS and Music apps.
GPS navigation:
• Direct free form access to all functions, free form input of city, locating and plotting routes to ad-dresses, generic POIs (hotels, restaurants, gas stations, etc.), custom and predefined POIs.
• Controlling map display functions such as zoom, position of map, etc.
Music player
• Free form selection of music by various categories such as genre, artist, album, song name or their combination.
Figure 2- Screenshot: City disambiguation screen |
Figure 3- Screenshot: Album disambiguation screen |
6. Demo
The following video is a demonstration of the 'Embedded Natural Language System' given at CES (Consumer Electronics Show) in Las Vegas, Jan. 2008.
CES.wmv ( Res: 320x240 file size: 19 MB Download time:>1 min.)
