EMERGING APPLICATIONS
Speech recognition applications have direct benefits to the organizations that adopt them in terms of productivity, safety, and reduced staffing. Efforts in development, and in some cases new products, are being announced to combine ASR with facsimile machines, imaging, and industrial controls. Assuming that the long- term trends in hardware continue to make intelligent devices cheaper, smaller, and faster, with more data storage capability, automatic speech recognition and especially speaker-independent recognition will play an important role in several areas; for example:
- For devices that are too small for other interfaces. Product development is now taking place on pocket-sized devices that incorporate the functions of a personal reminder and address book with a mobile telephone.
- For applications where keyboards are impractical. In industrial situations such as machine shops and assembly lines, keyboards rapidly become dirty and nonfunctional. ASR technology can also eliminate the need for manual entry in situations in which keyboards are subject to weather or other hazardous conditions. Examples of such use include public or customer access to automated teller machines, vending kiosks, and information kiosks.
- As an easy-to-learn means for untrained users and the general public to concurrently use a single information resource.
- As embedded technology in other devices for which other interfaces would detract from the design or efficiency. The technology is already being used in automobiles, and its use in home appliances is anticipated.
- For applications for the visually impaired as well as for communications under conditions in which the hands are in use or vision should not be distracted. One current application supports data entry by dock workers who wear small wireless microphones to dictate data about shipments and receipts for entry into the information system.
- As an alternative to systems that use touchtone input in conjunction with IVR. Between 20% and 40% of callers to such systems either do not have a touchtone instrument or refuse to use touchtone. In many cases, the aversion is not to the touchtone instrument but to the extensive menus that are recited before the caller can identify the sequence of buttons to push. ASR techniques can substantially condense or eliminate menus.
SELECTING AN ASR PRODUCT
It is easy to overbuy and acquire technology that is too costly and sophisticated for real business needs. By the same token, if IS developers and users define current needs too narrowly and do not anticipate the future, they underbuy. Some buyers have opted for wordspotting systems only to find that their vendor did not support spoken language understanding (i.e., natural language). When full natural language capability is needed, a complete system replacement can be very expensive.
Once an application specification has been written, selection of the system depends first on who will be using it. When one or only a few specific users will use the system, the lowest- priced approach that meets their needs should be chosen. However, users must have the patience to train the system and be willing to take the isolated-word approach to dictation.
When an application involves so-called promiscuous speech (i.e., lots of people talking to the system), selection demands more stringent criteria. Obviously, the system must support speaker-independent, continuous speech recognition and the ability to translate speech into ASCII text. Additional criteria are usually called for. The weight placed on the criteria is, of course, a function of the intended application.
Functions and Features to Look For
When choosing a system for many users, IT buyers should look for certain features and functions:
- The system must be able to translate and respond with minimal perceptible lag time.
- If the system is intended for use in telephony applications, it must work with regular, unmodified telephone instruments, speakerphones, and cellular phones. It must also support barge-in (i.e., allowing speakers to talk while the system is still responding) and the lVR system in use or to be in use (i.e., it must be easily ported to that system). Finally, it must be able to alternate with touchtone input.
- If the system is intended to support broad-ranging inquiry, it must support a vocabulary of more than 60,000 words. The Wall Street Journal test is a good measure of performance. In contrast to most newspapers, which use a vocabulary of less than 15,000 words, the Journal uses a 40,000-word vocabulary. If the system performs with reasonable accuracy in translating Journal articles, it should work for this type of application.
- Depending on the application (e.g., electronic engineering, programming, or medicine), the system should support or be able to add specialized vocabularies.
- If the system is intended to support multiple specialists, it should be able to switch among vocabularies.
- The system should support word spotting.
- The system should support spoken language understanding. Even if there is no plan to use this feature initially, it may be a desirable feature to have in a product to allow for future needs.
- If the system is intended for nationwide use, it should support translation of multiple dialects or regional accents for a single language (or the vendor should be able to adapt the system to accommodate regional accents).
- If intended for international use, the system must support multiple languages (or the vendor should be able to adapt the system to accommodate foreign languages).
- The system should detect silences and recognize when the speaker has stopped talking.
- The system should listen at all times (rather than only when the speaker is given a signal).
- The system should operate on a server in a client/server environment.
- The system should support performance analysis (e.g., recognition confidence factors, recording of exact word sequence, percentage of successful transactions, average transaction processing time).
- The system should maintain a time-stamped activity log.
- Individual sessions should be able to be recorded for later playback.
Performance Characteristics
An evaluation of performance characteristics should ask the following:
- How many concurrent users can access the system without degradation of response time?
- Can the system work in a noisy environment, filtering out external noise?
- Is the system portable among a variety of standard computers?
- Is the system composed of software only (i.e., it does not require special devices other than those for digitizing voice input)?
Vendor Capability
When selecting a vendor, buyers should be sure to ask the following questions:
- Is the provider of the ASR system the original developer?
- Can the provider create custom applications for the organization?
- Does the provider offer around-the-clock emergency support for critical applications?
BUILDING VOICE RECOGNITION APPLICATIONS
Most of the packages used in specialized applications such as law or medicine are for single users or a limited number of users. However, for promiscuous-speech applications that allow the general public or a companys customers to access the system, it is usually necessary to design and build the application using the ASR system and to implement the application using its accompanying tools.