IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
The Human-Centered Interface
COVER STORY: Computers Come to their senses

By Emily Benedek

Rejecting the idea that people must adapt to machines, IBM scientists are enabling computers to listen to our speech, sense our gaze and read our body language.

In Brief:

Imagine using a computer without ever having to touch a keyboard or read a manual. Several "multimodal" systems are being developed to allow just that. Taking advantage of the human ability to communicate through a variety of channels at once, the new technologies aim to produce more natural forms of human-computer interaction.

Despite the technological distance that separates the stone ax and the flint knife from the modern-day computer, there is still something to be learned from those ancient implements. By necessity, the tools of the past were closely adapted to their human users. Their efficiency depended on a close match between their form and the strength and dexterity of those they served. Computers, on the other hand, have evolved quite differently. Originally designed for specialized use by highly trained users, they have only recently begun to be adapted to everyday life. That process is far from complete, but the goal has become clearer, thanks in large part to the efforts of researchers at IBM who are not content to let current technology dictate how humans and computers interact.

Consider, for instance, Mark Lucente, a staff member at IBM's Thomas J. Watson Research Center. He hates computer keyboards, and he's not particularly fond of desktop monitors. Moreover, he thinks it's asking too much to require people to learn awkward languages to get the information they need from a machine. In his view, computers should at least meet humans halfway, which for him means that they should have "the courtesy to respond to a person who is trying to communicate with them."

c Lucente's research represents an emerging trend toward developing so-called multimodal interfaces that respond to voice, gesture and other forms of human-computer interaction. With similar ends in mind, John Vergo, manager of advanced human-computer interfaces at Watson, and advisory programmer Jennifer Lai are designing software that allows users to dictate and edit texts without the encumbrance of a keyboard. And a team at IBM's Almaden Research Center is creating technology that enables computers to determine the user's emotional and cognitive state by monitoring facial expressions, body gestures, speech and gaze (see "The Eyes Have It"). As such work progresses, the computer will gradually become a tool that adapts to humans instead of a tool that requires humans to adapt to it.

THE WORLD IN HIS HANDS

Judging from the stir Lucente created at the Comdex computer show in November with the unveiling of his Visualization Space, the world is ready for a more user-friendly form of human-computer interaction. In essence, what Lucente has in mind are machines that respond to the oldest means of communication: voice and gesture.

At his lab, Lucente repeats the demonstration that captured the interest of Comdex viewers. Standing before a wall-sized projection screen, Lucente commands, "Give me the world." Almost immediately, an image of the globe appears on the screen before him.

"Make it spin," says Lucente, and so it does. "Move it over here," he says, pointing to the lower right-hand corner of the screen. Dutifully, the globe follows.

A video camera, in this case embedded in the ceiling, serves as the eyes of a computer, an IBM Netfinity® 7000 PC running the Windows NT® operating system. Lucente and his late colleague Gert-Jan Zwart, collaborating with researchers at MIT, created
machine-vision algorithms that find and track a user's head, hands and body. Lucente's system uses this spatial information to interpret the user's movements as meaningful instructions. The user's words, picked up by a small microphone in the shirt pocket, are recognized by IBM's ViaVoice™ Gold speech recognition software, introduced last summer. The system listens for any of a large number of commands.

The prototype turns the traditional notion of computer mastery on its head. Lucente believes it's time that computers made an effort to understand us, rather than the other way around. Applying this new approach, which he terms "natural interaction," Lucente can communicate the way he knows best, with everyday language. "I'm controlling this stuff," he says, gesturing toward the screen, "just by talking and pointing and walking around. The beauty is that I already know how to talk and walk around. There are billions of people in the world who don't currently use computers, many of whom say, 'I don't like computers. I don't know how to use them.' Natural interaction means that they already do know how to use them."

Lucente issues another command, "Let's look at some molecules." On the screen appear a pair of multicolored renderings of atoms and chemical bonds. When he moves his hands, the two molecules rotate and move -- each in a different direction -- on the screen.

"The reason I call this the Visualization Space," he says, "is that I created it for scientific visualization -- for example, for pharmaceutical researchers who want to look at the shapes of molecules and the spatial relationships among them. Natural interactivity provides a way to move them around, using both your hands, which is something you can't do on a regular computer."

Lucente also believes that friendlier, more natural computer interaction is perfect for teaching, as it allows students to interact with the subjects they are studying and even to embark on virtual travels in time or space. A request to see dinosaurs might call up scenes from Jurassic Park or a prehistoric world. And as students ask questions about the inhabitants of that world, answers can be flashed on omnipresent screens embedded in walls, tables and other objects.

Lucente came up with the idea of natural computing while working on his doctoral thesis at MIT. Along with a group at the Media Lab, he created the first-ever dynamic holographic display, which produced lifelike images using a three-dimensional computer graphic model. It occurred to him that, "since computer images have the potential to look as real as the real world, we should be able to interact with them like we interact with the real world."

So Lucente decided to dedicate himself to making human-machine interaction as similar as possible to human interaction in the real world. While you can't actually feel or pick up a computer-generated image -- even a holographic one -- the Visualization Space gives people who try it the strong sensation that they are manipulating objects. And what Lucente has thus far accomplished -- the result of two years' work -- is just a hint of what is to come. "I call this the can opener," says Lucente. "It's to open people's minds to think about how we now have the technology to start designing computers from the ground up, ideally suited to humans, and even better, to give the computer the ability to understand each individual person, so it can learn to interpret what we mean when we say things."

Lucente envisions the day when houses will contain a myriad of embedded screens, cameras, sensors and processors capable of becoming intimately familiar with a person's tastes, moods and habits. Imagine coming home from a grueling day and finding that your household computer, sensing the heaviness in your approaching footsteps, has already popped a soothing CD into the stereo.

TAKING DICTATION

For Vergo and Lai, improving the computer interface relies on a different multimodal approach: the researchers combine pointing with speech recognition technology that includes natural language understanding (NLU). Their latest project grew directly from their work on MedSpeak, the world's first large-vocabulary continuous-speech recognition product, which allows radiologists to dictate reports. Based on their experience with MedSpeak -- Vergo had been its architect and Lai had designed the user interface -- they came to realize that moving from the restricted domain of radiology to general English would require certain accommodations. While the arrival of ViaVoice Gold enabled computers to recognize a broad enough range of speech, Vergo points to two long-standing issues of speech recognition systems that needed to be addressed.

First, users must know what sorts of commands the system will recognize. Second, they must be able to navigate easily around the documents they have dictated in order to edit them. Navigation is important not only because transcription errors can arise, notes Vergo, but even more because most of us are imperfect dictators. Some basic questions, says Lai, are: "How does one tell the computer where to go -- which sentence to look at and work on? Does one use a mouse or point with one's finger? Must one move the cursor using directions, such as 'two rows up, three characters to the left,' or is there another, more user-friendly approach?"

The simplest solution to the first problem is to create a list of key commands -- such as "print" and "open new file" -- that the system will recognize. "But," says Vergo, "that is a very constrained approach." A less restrictive approach would be to allow more complex phrases such as "save it and print," but, again, if the phrases are not in the system's list, it can't comply. The most general, but also the most difficult, solution would be to design a system based on natural language understanding, a system that can interpret the meaning of words and phrases without having to "learn" them in advance. "That removes the onus from users of remembering a list of commands, letting them focus on the task they want to accomplish by talking to the computer in a normal way," says Vergo. The problem is that NLU technology is not yet ready to handle a completely unconstrained vocabulary.

"DELETE THIS SENTENCE"

Even so, Vergo and Lai have managed to endow their system with a form of understanding by basing it on an NLU "engine" being developed by Salim Roukos in Watson's speech recognition group (see "Talking to Machines," Research, Number 1, 1997). They have taught their prototype what they call "document structures" -- the concept of words, sentences, paragraphs, and beginnings and endings. On the basis of that knowledge, which is not simply embodied in a list, the system can understand and respond to such phrases as "Uppercase this word," "Delete this sentence," "Move this paragraph to the end of the file," and more subtle instructions, such as "Go to the word 'dog' and delete to the end of the sentence." It can also execute global commands such as "Change 'May 3' to 'May 5' throughout the file."

Of course, to refer to a particular word or sentence requires some way of pointing. The obvious tool would be a mouse, but Vergo and Lai would prefer a pointing mechanism that is more direct, as well as easier for people with limited dexterity to use. They have therefore designed a system that provides the option of using a pen-based LCD tablet, in which the user points a stylus directly at the word or line of interest. The graphical user interface, designed by Lai, sets aside regions of the screen for dictation, user commands and actions taken by the system.

The two researchers plan to have their application ready in about a year, but they expect it will grow in sophistication over time. Eventually, they hope to incorporate an intelligent search capability that will allow the user to describe passages conceptually rather than literally. "For example," Vergo explains, "you should be able to say, 'Go to the conclusion,' even if the word 'conclusion' never appears in the document. Or, if you're talking about a lawsuit, you could say, 'Go to the paragraph where we discuss the litigation of this case.' I think this is a very natural way for people to interact with their documents."

Further ahead, the researchers may even introduce a "social interface," in which an intelligent agent engages in dialogue with the user. "If the PC really evolves toward being an assistant that helps people accomplish their tasks," says Vergo, "then it should always be watching to understand when you're having problems and should try to make appropriate suggestions."

As these research ideas become available to users, the very notion of computing will change, evolving from a task we learn to one we intuit. Powerful tools will adapt to the strengths and weaknesses of their human users, regardless of their level of sophistication. Like Molière's Monsieur Jourdain, who was astonished to learn that he had been speaking prose his whole life, even complete novices may one day discover that they have known how to converse with computers all along.

FYI: http://www.research.ibm.com/imaging/vizspace.html


Emily Benedek is a writer based in New York City.


More Information:

What Your Computer Says About You

The Eyes Have It


What Your Computer Says About You

Interface design shapes our relationships not just with computers but also with other people, according to studies by the user ergonomics research group at IBM's Almaden Research Center. Psychologist Chris Dryer and IBM Fellow Ted Selker, who heads the group, have begun investigating the effects of human-computer interfaces on the way people judge one another during collaborative interactions. The results to date are intriguing.

In a recent study, the researchers invited people to the lab to collaborate in pairs on solving a problem, a task that involved the use of a handheld IBM PC 110 computer. In one case, they made it easy for both parties to see the display. In another case, they used a belt-worn version of the computer that was designed so that only the wearer could see the display. "The effects were striking and unexpected," says Dryer. "When the participants were asked to judge the personalities of the other person, the person wearing the device was consistently perceived to be less friendly, even though that person was in no way responsible for the restricted visibility of the display." None of the participants commented on the uncollaborative nature of the design of the device itself or gave any indication of being aware of the impact that the interface had on their social judgments.

Dryer believes that such studies of unconscious inferences might help to determine how technologies will be used. Designs that hinder human relations would simply not be adopted for the emerging field of collaborative computing -- which could range from gathering applicant information during an interview to using online information on a factory floor. If Dryer is right, the psychological
implications of an interface design may come to be judged on an equal footing with ease of use.


The Eyes Have It

Speech and gestures are not the only promising avenues for communicating more naturally with computers. A team of scientists at IBM's Almaden Research Center is concentrating on extending the range of interactions. The project, known as Blue Eyes -- part of Almaden's 1997 Adventurous Systems and Software Research program (see "Adventures in Software,"F
Research, Number 1, 1997) -- explores various ways of allowing people to operate computers without conscious effort. Gaze tracking seemed like a natural place to start.

The Almaden team is pursuing a novel approach called MAGIC (for manual acquisition with gaze-initiated cursor) pointing. "When people think of gaze, they typically think of replacing the mouse," says researcher Myron Flickner. "That's been tried more than once, and it's always failed." Instead, the eyes should complement the mouse. "By our method," says Shumin Zhai, "the cursor never moves until you touch the mouse or any other pointing device. Then the cursor jumps to the place on the screen that you were looking at." Current gaze-tracking technology is accurate to within about a degree -- half an inch or so on a typical screen -- which means the cursor usually lands close to its destination but requires manual control for the final approach.

But that's still a big improvement, according to Flickner. "By eliminating long cross-screen drags," he says, "one can save a considerable amount of time. In the ergonomics world, a 10 percent saving is considered good; our system saves closer to 50 percent. And that figure increases as displays get larger."

The beauty of MAGIC, Zhai says, is that it enables the hand and the eye to do what each does best. "Our approach takes advantage of the perceptual ability of the eye to reduce the effort required for manual pointing," he explains. "To the user, pointing is still done by the hand, the natural organ for
manipulation, but the cursor always appears in about the right place, as if by magic." Previous efforts failed, says Zhai, because they tried to force the eye to perform a task better suited to the hand.





    About IBMPrivacyContact