IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Capture the moment

By Gary Taubes

Tracking down just the right video segment will soon be as easy as clicking a mouse.


The coming age of Internet video will be a feast for the eyes, with an endless array of entertainment, news, sports, information, training and education videos spread out for the choosing. The challenge is to make the feast digestible. Viewers and content providers will need to sort through the visual banquet for the videos they want, and then find the relevant portions. Somehow, digital images must be annotated and indexed automatically, allowing them to be searched with a simple query and a mouse click.

IBM researchers are working on technologies that do just that. A group at IBM's Thomas J. Watson Research Center is developing a system that will use sophisticated visual and audio analysis to automate the indexing of video content and make it easy for people to search videos in the high-definition television (HDTV) studios of the future. Other researchers at IBM's Almaden Research Center are automating the process of indexing and summarizing video content for distributed education and training.

A group at the Tokyo Research Laboratory, meanwhile, is working on a system that will turn the act of watching a soccer game into an interactive experience (see "An eye for action," page 14). In all cases, the researchers aim to make video as easy to peruse and search as text.

TRANSFORMING THE TV STUDIO

VideoVista, the HDTV project under way at Watson, began in 1995 as part of a project initiated by James Janniello and Ruud Bolle at IBM Research and funded by the National Institute of Standards and Technology to develop the components of the future HDTV studio. Bolle and his colleagues in the exploratory computer vision group set out to create a search and retrieval system that would use computer vision technology and sophisticated indexing techniques to extract a range of visual and textual information from any video segment. They wanted a system that could, say, identify who was speaking about what or annotate the slam dunks in a basketball game. The same system could then be used to store and retrieve videos, and make it easy to cull the highlights from a lengthy segment. Ideally, VideoVista could then be used to replace the teams of librarians that current-day TV studios use to find relevant clips.

The first step was to teach their system to recognize a human face when it sees one. This is done by looking for what Bolle calls "very simple information" -- for example, skin tones on elliptical shapes with horizontal features that would imply eyes, eyebrows and mouth. "Once you detect faces in the video," says Bolle, "the size of the face gives you the camera setting, whether it's a close-up, a medium shot or a long shot." Bolle and his colleagues are also developing ways to differentiate shots in which the camera is panning across the field of view from stationary shots where people or objects are moving.

The VideoVista system will also extract closed-caption text for indexing purposes, as well as any text on the screen that might help identify what's going on and be used for later search queries. "Not just printed characters on the screen," says Bob Mack, who is manager of the information design and access group at Watson and who collaborates on the VideoVista project, "but textual graphics burned into the video frame, like subtitles, or the number or name on a football jersey, or even the license plate number on a car." With this capability -- also developed in Bolle's group -- VideoVista applies optical character recognition to turn textual graphics into straight text. "It's a good way of automatically indexing the content of the video through the different sources of text associated with it," says Mack.

To enhance the indexing process, VideoVista makes use of a technology known as Textract, first developed by a Research team led by Roy Byrd and now sold as part of IBM's Intelligent Miner for Text product. Textract can index concepts as well as words. It analyzes the text in source documents or text extracted from video, and then identifies technical terms, expressions such as dates and prices, and names of people, places and organizations. It further recognizes variant names for the same concept, such as "William J. Clinton," "Bill Clinton," "Mr. Clinton" and "Bill." Textract recognizes when such concepts are related by using both linguistic and statistical analysis methods. "For example," says Mack, "you might notice in a collection of documents that 'Bill Clinton' occurs a lot with the concepts 'president' and 'United States.'" Using Textract and other VideoVista capabilities, Bolle says, it should be possible to say, "find me a video clip in which Bill Clinton and two other people are wal king across the screen from right to left."

Yet Bolle and his colleagues realize that searching on names and camera movements is only the beginning for a useful system. "People are going to want to have queries like 'find me a slam dunk,'" says Bolle, "or 'give me all the slam dunks in the basketball game', or 'find me all the goals in yesterday's soccer games.'" This requires a capability that the researchers call event detection, which will be specific to the different domains in which VideoVista might be used. In effect, they will teach VideoVista to learn to identify a particular event, such as a slam dunk, by feeding it numerous examples and letting the system learn what parameters can be used to identify that event in the future. "That's the next step," Bolle says, "and we believe we can train the system to recognize those types of events and then label them."

LEARNING ON CUE

While VideoVista is developing its technology, at least in the short term, for an HDTV studio, the CueVideo project at Almaden, in a wide-ranging collaboration with researchers at Watson and the Haifa Research Laboratory, targets the burgeoning education and training market. U.S. corporations now spend some $60 billion a year on training and education, says Dragutin Petkovic, who leads Almaden's visual media management group. And that market will become ever more video-oriented as instruction shifts from classrooms and textbooks to the digital domains of distance learning and education on demand.

There are two key bottlenecks in making video as easy to process and use as text, says Petkovic. "In education and training you need dramatic reduction in the cost of preparing those videos, cataloging and creating content," he says. "And you need timely indexing. When students have attended a talk or lecture, they want to review it the same night. And they don't want to play the entire video; they want to go right to the area of the talk that they're interested in."

Today, indexing and annotating the text of a video is excruciatingly labor-intensive. An expert might spend days processing a single hour of video. "But with CueVideo," says Petkovic, "the only human input necessary will be in the minute it takes someone to type in the title, speaker, subject and date of the video. The system will do everything else automatically." For each hour of video, CueVideo will take no more than one hour to process it on a PC. That processed video can then be posted on the Web with CueVideo's full range of browsing and searching modes.

The first step in the CueVideo annotation process is to break the video into short segments for searching or browsing. CueVideo, as designed by Petkovic and his team, does this by looking for sequences in which the visual content is consistent -- shot from a single camera point of view, for example. Then CueVideo identifies a frame that best represents the particular segment, and it lines up a set of key frames into a storyboard. Users can then click on frames and either play the video from that point -- sans audio -- or view the frames as still images while listening to the accompanying audio. Once they have found what they want, the users can switch to full video and audio.

For searching and indexing, CueVideo separates the speech portion of the audio track from any background noise or music. It then employs IBM's ViaVoice speech recognition technology to create a transcript of the text. ViaVoice was originally designed to do office dictation, which it can achieve with above 90 percent accuracy, according to Mahesh Viswanathan, a member of Watson's human language technologies group, who collaborates on CueVideo. Transcribing a lecture, ViaVoice's accuracy is lower, but still sufficient for the task of creating a searchable index.

What comes out of the transcription process is flowing text, with no punctuation or paragraph breaks. CueVideo breaks this stream of words into manageable segments, each corresponding to roughly a minute of video. When a user enters a search query, CueVideo quickly returns segments that contain the search term and ranks them by relevance.

At the moment, the system uses Textract to enhance the search and query process. But Petkovic expects that CueVideo will soon be able to handle considerably more sophisticated searches. "Textract will extract words that exist, and also some combinations and phrases," he explains. "But customers will provide us with application-specific words, phrases and taxonomies that can help us extract higher-level concepts -- 'installation maintenance,' for example. It will be like having a table of contents as opposed to just an index."

JUST BROWSING

Once the user has posed a query, CueVideo is likely to find multiple video segments or entire videos that are possible answers, just as a Web search engine does with a text query. "Suppose you get 15 videos back," says Petkovic. "Are you going to play each one all the way through to find which is best? You'll probably want to browse first."

As an aid to browsing, CueVideo can speed up the audio track without making speakers sound like Mickey Mouse. "You can run through the entire movie and get the gist of what is going on," says Zohar Sivan, manager of the Audio and Video Technologies Group at Haifa, which adapted the "time scale modification" technology that enables speech to be smoothly compressed.

Perhaps the most novel feature of CueVideo is that it enables the viewer to search and browse videos by using the slides (Freelance Graphics® or PowerPoint®) displayed by the speaker who gave the lecture or taught the class. "You can search through the slides or foils, then click on the one that talks about the concept of interest and go directly to the segment of video or audio that refers to that slide," explains Petkovic. "This is possible even if the slide in the video appears behind the speaker on a whiteboard or screen." CueVideo automates this process by matching the text of slides with the words extracted by ViaVoice, or by matching the slide images with their appearance in the video.

Both CueVideo and VideoVista are being fine-tuned and extended. Bolle's VideoVista team, for instance, is working on a technology that will identify faces using only the visual information, a tool that could eventually be used in CueVideo, as well. While Petkovic and his team have been giving demonstrations and talking to customers about how CueVideo can best serve their needs, they've also been developing "domain-specific" versions tailored to specific vocabularies and applications, whether for aircraft manufacturing or medicine or sales.

Both teams are confident that, as video proliferates, the technologies they're creating will find a wide range of uses. Already, CueVideo is being incorporated into several IBM offerings. For example, the recently announced developerWorks portal for software developers plans to license videos of talks by well-known industry figures. The videos will be indexed and posted on the portal using CueVideo. And the International Center for Advanced Internet Research, to which IBM belongs, is using CueVideo to index material available on its prototype Video Portal for the Internet2 community.

As corporations continue to turn to video-based education to reach an increasingly mobile and global workforce, the opportunities for CueVideo will only grow. IBM recently formed an organization to focus on distributed learning, and discussions are underway to exploit CueVideo technology for indexing education videos. "We also have a joint research project with Boeing, in which Boeing users are evaluating the usefulness of CueVideo in the areas of video-based training, collaboration and video production," says Petkovic. "And we expect other pilots will take place as well. That's because there will be more and more videos to watch, but we'll still only have so much time to view them. So the ability to rapidly find key segments will become essential."


Gary Taubes, a freelance writer who lives in Venice, California, is a frequent contributor to Think Research

SIDEBAR: Visually speaking

SIDEBAR: An eye for action


SIDEBAR: Visually speaking

In a noisy environment, lip reading can help people under- stand speech. Now IBM researchers at the Thomas J. Watson Research Center, in collaboration with IBM's India Solutions Research Center in New Delhi, are working on an audiovisual speech recognition program to give computers the same ability. The goal is not only to increase the accuracy of speech recognition but to enhance a computer's ability to detect a change of speaker and to recognize who is talking, for example, in a video.

These capabilities, says collaboration leader Chalapathy Neti of the human languages technology group at Watson, can serve as valuable descriptors for searching videos for particular scenes. While that can be done by audio speech recognition alone, it can be done even better with the addition of vision.

Applied to speech recognition, the technique uses algorithms that first identify the mouth of a speaker in a video image and then visually distinguish between mouth shapes, or visemes, formed while talking. "If you look at yourself speaking in the mirror," says Neti, "you'll see that 'pa,' 'ba' and 'ma' look almost identical, whereas 'na,' which sounds very similar to 'ma,' looks entirely different."

By combining the visual information derived from the visemes with the audio information, says Neti, the accuracy of speech recognition programs can be improved. This can be of particular value in transcribing speech when there is background noise or when several speakers are talking at once.

The system can also help speed up the identification of the speakers in a video. Current systems create an acoustic signature for each speaker and compare it with the audio track as a different person begins to speak. Because this process requires at least three seconds of audio signal, short utterances cannot be attributed. By using visual cues, however, the identification can be made with just milliseconds of video.

The researchers are also working on techniques using visual cues that allow a computer to recognize when a speaker has begun to speak, even before speech recognition kicks in. With that ability, a person using a speech recognition program for dictation can speak naturally without the need to continually switch the microphone on and off, as is currently necessary to avoid the transcription of background noise during pauses. In addition, the technology enables even faster detection of the active speaker in a video with more than one.

The technologies developed by Neti and his collaborators could find their way into IBM products within a few years. They will be useful not only for indexing video, but for enhanced speech recognition in any noisy environment -- be it an automobile or a busy office or a conference room. And the required digital cameras, which are rapidly shrinking and coming down in price, could be put to work for other human-computer interactions as well (see "The Human-Centered Interface," IBM Research, Number 1 and 2, 1998).)

SIDEBAR: An eye for action

A technology under development at IBM's Tokyo Research Laboratory aims to analyze and interpret the contents of a video, revealing patterns and events that may not be readily apparent. Known as Video Enrichment, the technology notes the speed, position and movements of objects on a screen. It then sorts similar actions in such a way that they can be retrieved during a search of the video. It also makes possible the gathering of statistics about objects in a scene based solely on information extracted from the video. Video Enrichment could be applied in any domain where objects are moving, from recording sports events to monitoring highway traffic flows.

With the spread of broadband networks, Video Enrichment could be used to bring a new level of interactivity to viewing. In sports, for example, it will enable viewers to search for specific events and to review the corresponding scenes from different perspectives that, combined with the extracted information, will reveal to the viewer a team's strategy.

Using soccer as a test bed, the system is able to recognize particular types of action (sprinting or kicking, for example) and interactions (passing a ball or scoring a goal). These are then combined into an "event structure," which includes all the necessary information to retrieve the event using a query. Consider a "through pass," in which an offensive player passes to a teammate who is cutting toward the goal through the line of defensive players. Video Enrichment defines it as an interaction in which a pass is made, several defensive players are present during the relevant time interval and the ball's trajectory crosses an imaginary line between the defensive players. The system searches through the game for the combination of a pass with the various walking, running and other actions that make up the event structure of a through pass. It then tags particular instances in its database.

The Tokyo researchers are engaged in projects with the Communications Research Laboratory of the Japanese Ministry of Posts and Telecommunications, Korea's Electronics and Telecommunications Research Institute, and Princeton and Osaka universities. ]


    About IBMPrivacyContact