IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Words out of Characters

By Rowan Dordick

In Brief:

Using principles of speech recognition originally developed for English at the Thomas J. Watson Research Center, scientists at Watson and the China and Tokyo Research laboratories have developed the first continuous speech- recognition products for those languages. Chinese proved a particular challenge, because the language uses tones, while homonyms and multiple writing systems posed difficulties for Japanese. Both languages required the development of rules to segment text, which is written without spaces. The IBM researchers solved the problems by developing an algorithm that breaks up each syllable and by formulating rules for segmenting syllables.


At some time or other, probably everyone who types on a keyboard has lamented the fact that, to express one's thoughts, our ideas must take a detour through our fingers. Speech is quicker, easier and more natural. But try telling that to a machine. For years, scientists at IBM's Thomas J. Watson Research Center have been doing just that, and they have succeeded remarkably well.

Today, programs such as IBM's ViaVoice(TM) continuous speech-recognition products can turn one's personal computer into a reasonably good listener, albeit one that has no idea what it is being told (see "Speech Recognition Technology in a Nutshell"). But at this stage that is not a major concern. What matters is that speech-recognition technology has lowered the barrier to interacting with computers. While keyboards are far from being obsolete, they are no longer essential for every task.

Although first achieved for English, speech recognition is even more useful for the character-based languages, such as Chinese and Japanese, in which the barriers to keyboard-based text input are far higher. "There are more than a dozen methods of entering Chinese characters, and none of them are perfect," says Donald Tang, assistant director of the China Research Laboratory (CRL) and head of its speech-recognition group. Over the past 20 years, the sheer difficulty of keyboard input has stimulated numerous attempts in Asia and elsewhere to develop Chinese speech-recognition systems. The problem, however, seemed insuperable because of unique features of the language.

IBM's work on speech recognition for Chinese - specifically, Mandarin, the most widely spoken dialect - began only three years ago, in mid-1994. But by discovering how to build on earlier work for English, a Chinese-language version of ViaVoice was announced in September 1997, only one month after the introduction of the English-language product. And, in November, a Japanese version, ViaVoice Gold, was introduced. Developed at the Tokyo Research Laboratory, with product integration carried out by the Yamato Development Laboratory, the creators of the Japanese product not only had to deal with characters, but also had to contend with the complication of two phonetic alphabets, among other difficulties peculiar to Japanese (see "The Unique Challenges of Japanese").

Tones and Tonemes

The most distinguishing feature of Chinese and Japanese is the system of ideograms, or characters. In Chinese, each character corresponds to a single (though not always unique) spoken syllable, and each word consists of one or more syllables. The number of characters in average use is harder to pin down but is generally agreed to be between 4,000 and 6,000. Because there are only some 400 phonetically distinct syllables in Chinese, many of the characters are homonyms or near-homonyms.

The actual number of homonyms is reduced, however, by the existence of tones, which increases the number of syllables to about 1,400. The closest thing to tone in English is intonation, by which a speaker, for example, can modify the pitch of a word or phrase and thereby turn a statement into a question, or convey a notion such as skepticism or sarcasm. In Chinese, the five tones (high, rising, low, falling and untoned) play a more basic, semantic role. A one-syllable word such as "ma" can variously mean "mother," "rough," "horse," "curse," or the particle indicating a question, depending on its pitch contour.

While tones help overcome the homonym problem, they themselves pose a huge challenge, and the inability to handle tone stymied earlier attempts to master the recognition problem for Chinese. The favored approach, says Julian Chen - a former physicist whose linguistics background helped him make a midcareer switch into Watson's speech recognition group - "was to use two independent algorithms running in parallel. One would recognize the base syllable - that is, the consonants and vowels - and the other, the tone of the syllable, or the pitch contour."

The problem with such a two-track system, says Chen, is that, in continuous speech, the syllables tend to blur into one another. Moreover, the precise sound quality of the tone is determined by the sounds uttered before and after it, complicating the recognition task and reducing its accuracy.

The approach that Chen and his colleagues in the Watson speech group came up with was based on the realization that the tone could be treated in the same way as a phoneme in languages without tones. "Our method," says Chen, "involves a new algorithm that decomposes each syllable into two parts. The first part, called the preme, is independent of tone. The tone value of the syllable is completely contained in the second part, called the toneme, and the same vowel with different pitch contours are treated as different phomemes."

This approach necessitated creating a large phoneme set for Mandarin, with 161 phonemes, including 104 tonemes. While that was nearly three times as many phonemes as in IBM's English-language dictation systems, in the end it constituted a major simplification, because it allowed all the acoustic signal processing tools developed for English to be applied to Chinese.

Sentences into Words

While tones and the characters themselves create difficulties by adding features not found in Western languages, written Chinese causes an added difficulty for speech recognition by what it leaves out - namely, the spaces between words. That is a problem for a computer doing speech recognition, because merely analyzing the sounds produced by a speaker generally results in several equally possible word sequences. The choice is narrowed down to a single candidate based on the frequencies of occurrence of individual words, which are derived from the language model.

In alphabetic languages, creating a language model is laborious but straightforward. One examines a large number of texts and records the statistics of word occurrences. Specifically, one notes, for each pair of words, the number of times a given word follows them. But in Chinese, because there is no word boundary in written text, it is not entirely obvious where one word begins and another ends.

While certain segmenting rules are in general use, such as representing one concept by one word, they tend to be ambiguous. So, once again, Chen was forced to develop new rules. One such rule for segmenting words is to treat a long compound word as two words, if the probability of each occurring separately is high.

The language model was built in successive stages. Starting with an electronically readable vocabulary, Chen segmented a collection of texts drawn from newspapers and journals and computed the statistics. The rules were then applied to modify the vocabulary, in the process reducing it significantly, and resegmented the texts. By the third iteration, which incorporated additional texts for a total of 150 million characters - the equivalent of 100 million words - a language model was developed that could significantly reduce the error rate of the unaided acoustic analysis.

Although Chen had a prototype of a speaker-dependent continuous Chinese language speech-recognition system by the end of 1994, an acoustic model derived from the voices of multiple speakers was required to make it speaker-independent. By recruiting some 50 Beijing-born Chinese from the New York area, including some trained as TV announcers, and having them read from a set of more than 30,000 sentences, Chen was able to create a speaker-independent acoustic model by the summer of 1995.

From Prototype to Product

The successful demonstration of the prototype at the opening ceremony of CRL in Beijing on September 21, 1995, reinforced the commitment to create a Chinese speech-recognition offering. A concerted effort to transform the prototype into a full-fledged product was begun, and a small team at CRL took on the task of refining the acoustic and language models. In December, as the work got under way, IBM China established a small speech-recognition product group.

"One of the problems in China," says Katherine Shen, a member of the CRL speech group, "is that lots of Chinese speak Mandarin with different accents, depending on what their native dialect is." By having 350 people from different regions of China speak 100 sentences drawn from a corpus of 40,000 sentences, an enlarged acoustic model was created.

The enhanced acoustic model enables the system to handle more variations of the standard Beijing-area accent, but, notes Shen, it is still limited to lightly accented speakers. One of the achievements of the CRL team is an adaptation, or enrollment, mechanism for individuals. By speaking approximately 250 selected sentences, users can train the system to recognize their accents. However, even with adaptation, the system has difficulty recognizing those with heavy accents.

Ultimately, says Shen, additional acoustic models for different regions of China may be incorporated into the speech-recognition system, allowing users to simply choose the one appropriate for their accent. A related problem is that of incorporating English words. "Currently, we use Chinese sounds to describe the English words, but we would like to define additional phonemes that would more precisely model the sounds when spoken by a Chinese speaker," says Shen.

The CRL team also worked on enlarging the language model, since recognition accuracy depends on having statistical information available for all the words a speaker might say. On the other hand, points out Jerry Zhu, a colleague of Shen, "the more specific the language model, the higher the recognition accuracy, because the system has better statistics and can better predict the possible words." As in the case of the acoustic model, a collection of models selectable by the user may prove the best solution, he adds.

An Integrated Effort

Once the CRL team had completed enlarging the acoustic and language models, programmers at Watson began the process of redesigning and modifying the ViaVoice product for the Chinese language. To begin with, both the recognition engine and the user-interface programs had to be modified to work with the double byte character set (DBCS) required for Chinese characters. Since DBCS enablement had already been done for Japanese VoiceType products by the product team at the Yamato Development Laboratory, the Japanese software served as a base for developing the Chinese version.

In addition, the ViaVoice speech recognition engine had to be modified to handle tones, while the ViaVoice user interface had to be redesigned to handle nonsegmented text, as well as phonetic input of Chinese, called Pinyin, used for error correction.

Product testing was also undertaken, and early versions of the product were sent to customers and PC hardware manufacturers in China for evaluation. "Most of the problems we encountered," says Gregg Daggett, who was in charge of the product programming effort, "were compatibility problems with unique hardware, such as soundcards, in the Chinese market that were not 100 percent compatible with the standards we designed the product for."

Daggett and his co-workers also created a toolkit for independent software vendors (ISVs) that could be used to develop special speech-enabled applications. "More than half a dozen ISVs are using the toolkit," says Daggett. Among the applications they have developed is a voice-based command and control navigator for Windows 95® and a Chinese-language dictation system that allows users to make corrections on-screen with a pen-based computing application.

As the number of these applications continue to grow and the acoustic model of the ViaVoice Chinese speech recognition product is extended to include a greater range of Mandarin accents, the difficulties of Chinese-language based computing will decline as the productivity of users increases.


Speech Recognition Technology in a Nutshell

The magic of computer speech recognition is that it is done without any attempt to model the way in which humans understand speech. Instead, the process involves the decomposition of speech sounds into units called prototypes, which are similar to phonemes - the basic set of distinctive sounds in a language - but more numerous. The prototypes for a given language are used to create what is called an acoustic model of individual words, which are then combined into the "dictionary" of the speech-recognition system.

Each word model is described in probabilistic terms, technically known as a hidden Markov model. For example, based on an average computed from the input of a large number of native speakers, the model, in effect, describes a word as a product of the probabilities that an "ideal" speaker will utter a particular sequence of prototypes when pronouncing a word. During the recognition phase, a speaker's words are similarly represented as a sequence of prototypes. Since each speaker pronounces words slightly differently from everyone else - and even differently from utterance to utterance - the model merely looks for a likely match between the spoken word and its model. Often, this results in several possible candidates.

The reduction to a single candidate is accomplished by means of the language model. The language is created by analyzing a large body of written material - IBM's first language model was based on business correspondence - and tabulating the frequencies of word sequences. This is done by starting at the beginning of a document and looking at, say, the first three words, a choice that results in a so-called "trigram" model. The third word is recorded as having followed the preceding two. One then moves over one word and looks at the next triplet of words, again recording the result.

Finally, after examining a large number of texts, a set of frequencies, or probabilities, is obtained for all the words in the system's dictionary. Given a sequence of two words that have already been guessed, the speech-recognition program can then distinguish among several candidate words based on which is more probable, that is, has been found to follow the other two words most frequently.

It may happen, of course, that the three words never occur in the corpus of texts used to create the language model. In that case, explains Watson's Michael Monkowski - who helped implement the Chinese continuous speech-recognition prototype system - the system will look at the previous word alone, a bigram model, for which statistics are also available. And if that pairing is not in the corpus, and hence no statistics exist, the recorded frequency of each word in the corpus is used to choose between candidates.


The Unique Challenges of Japanese

While Japanese uses Chinese characters, it does not have to contend with tones, which amounts to one less complication for Japanese speech recognition. Yet, because there are fewer distinct sounds available, the homonym problem is even more daunting than in Chinese. That, and the related problem of multiple pronunciations for a given character, together with the various forms of writing to express any Japanese word, constitute a formidable challenge for speech recognition technology.

The homonym problem is particularly vexing. The sound of a given word can correspond to many different Kanji, the Japanese name for the characters borrowed from Chinese. Japanese names are exceptional in this regard, and a single phonetic name can correspond to literally dozens of Kanji. In dictating, therefore, phonetic symbols, or Kana, are used to represent the sound, and the user may have subsequently to choose the intended Kanji from a menu brought up when one clicks on the Kana representation.

The inverse problem also exists: more than half of all Kanji have multiple pronunciations. This multiplicity arises from several factors. The most notable of these is that a character can be used with its original Chinese pronunciation and meaning or it can be used to represent the stem of a Japanese word. The Kanas are used to indicate which sound is meant, and it is therefore essential that the appropriate Kanas be incorporated in the speech recognition engine.

Finally, Japanese offers a variety of ways to write any given word. In addition to Kanji, two phonetic alphabets - Hiragana and Katakana - are used, depending on the author's preference. There are other options as well, for example, writing an English word in English or in Katakana. The result of this abundance is that the vocabulary for a Japanese speech recognition system - that is, each possible written form of a word - is much larger than the vocabulary needed for other languages.

Japanese also requires rules to segment written text in order to build a language model. An automatic technique was developed at TRL for doing this based on the notions of tokens. "Each token consists of usually one word," says Hiroshi Kaneko, project manager of speech and language processing, but in some cases we group together an adjective and a noun or a verb and an adverb." A set of rules - called a tokenizer - determines how words are to be demarcated into tokens. Once the text is segmented in this way, the statistics for the trigram language model can be calculated.




    About IBMPrivacyContact