Previous | Table of Contents | Next |
Louis Fried
Most of the techniques associated with multimedia applications video, text, graphics, object-oriented interfaces, handwritten input, touch-screens, scanning, and mouse point- and- click techniques focus on a visually based human interface. Not enough has been said about the most natural of all human interfaces speech. During the last few years, automatic speech recognition (ASR) technology has made substantial strides that warrant active consideration of speech as an effective interface for many emerging multimedia applications.
Although speech or voice recognition technology is not new, applications have often been confined by the limitations of the technology. In the past, most speech recognition systems could only recognize words spoken in isolation. If users wanted the computer to understand their speech, they had to....separate...the...words...by...brief...silences. In addition, most systems required that the speaker spend hours training the system to understand his or her individual pronunciation. These constraints still apply to many of the speech recognition products currently available for operation on personal computers.
Systems that require training by the user are generally limited to supporting one to four users and generally have limited or specialized vocabularies. Systems that operate in this mode include products from Kurzweil, Applied Intelligence, Inc., Dragon Systems, and IBM Corp. Although they are extremely useful for providing computer access to impaired people or to professionals who must keep their hands free for other work while accessing a computer, these systems do not meet the needs for general or public access to computer systems in what could be called a natural manner.
If a speech recognition system is to be of widespread, practical use, it must operate without individual training (i.e., be speaker independent), and it must understand continuous speech, without pauses, which is how people converse. Two approaches have been taken to solve the problem of speaker-independent recognition: the synthetic modeling approach and the sampling approach.
Synthetic Modeling. Synthetic modeling builds words out of syllables that represent speech sounds. Using a database of these subwords, words can be created in the recognition tables by phoneme transcription much like constructing words using the pronunciation codes that are listed following a word in a dictionary.
Sampling. The sampling approach, which has proved to be more accurate for interpreting regional accents, involves gathering large numbers of spoken samples of utterances (referred to as tokens) and constructing words from them that match the anticipated user population. The broader a population of users, the more samples are needed. For example, in constructing a speaker-independent recognition Corona TM system, which is intended to be useful across the entire US, SRI Internationals researchers gathered more than 400,000 utterances from all regions of the country.
A system intended for widespread use must be capable of recognizing a large number of words. Given the current state of the art, a system that can recognize anyone saying yes or no is trivial. Early vocabulary-building applications could support up to about 2,000 words and phrases. More advanced systems can support vocabularies of 60,000 or more words. Vocabularies for widespread application must be capable of recognizing tens of thousands of words in essentially real-time as far as the speaker is concerned. Only in this way can a humans conversation with a computer appear to be natural.
The Complexities of Speech. New systems appearing commercially today do a creditable job of recognizing speaker-independent, continuous speech with an accuracy of better than 95%. However, there is a great deal of complexity that must be overcome. A sampling of the complex problems that need to be solved follows.
Homonyms. Most languages, especially English, have many homonyms, words that sound alike but may have different spellings and different meanings (e.g., two, to, and too). The ability to distinguish among homonyms depends on either limiting the vocabulary in some manner to eliminate homonyms or being able to understand the context in which the word is spoken.
Numbers. Numbers are difficult to recognize. Not only do some numbers sound like other words, but people say numbers in different ways. If asked for a telephone number, most people will say each number separately, but if asked for a dollar amount, most people will combine the numbers in some fashion, such as five thousand four hundred and thirty-six dollars and forty-seven cents. Long strings of numbers, or numbers and letters combined (such as in an automobile vehicle serial number) can be very difficult to understand. For this reason, many applications require the speaker to break up a longer number into smaller groups; a social security number might be broken into groups of three, two, and four, and a credit card number into groups of four digits.
Background Noise. Understanding speech is further complicated when a telephone is the instrument of voice capture. Over-the-phone speech is subject to the background hiss of phone lines. Good systems will include algorithms f or cleaning the speech of background noise. Similar techniques must be applied if the speaker is talking in a noisy environment.
Spoken Language Understanding. Automatic speech recognition is not the same thing as spoken language, or natural language, understanding. An ASR system, for example, could pick out certain words in a stream of speech and, by spotting these words in the context of a limited dialog; respond as if it had understood a complete sentence. For example, if a customer calls a banks speech recognition system and requests a current account balance, a wordspotting system may simply pick out the word balance and provide the correct answer. Applications based on word-spotting techniques require that the dialog between the caller and the system be carefully constructed to limit the range of responses that the caller may make.
Spoken language understanding systems, by contrast, would operate without the limited context of word spotting and truly analyze the entire sentence before responding. Furthermore, such systems may have the capability to handle references to earlier sentences in the conversation or topic changes. For example, a conversation may include an inquiry about whether a particular check had cleared. After receiving a negative response, the speaker may say, Put a stop payment on it. Here, a system could recognize that it refers to the check previously discussed.
Previous | Table of Contents | Next |