|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON Web Services From Desktop to Server: Speech Recognition Moves Upstream
From Desktop to Server: Speech Recognition Moves Upstream
By: Kimberlee Kemble
Apr. 24, 2002 12:00 AM
Speech recognition is the process by which computer-based software converts audible voice into digital text. When you think of computer-based speech recognition, most people picture someone sitting at a desk, wearing a headset microphone, dictating large volumes of text into a desktop system. But speech-recognition technology, over the past decade, has moved from the desktop to the server, from use by an individual to use by the enterprise.
Take the following examples:
What do the above have in common? They're all snippets of dialog that have been extracted from actual applications that utilize speech recognition. What's more, all of these applications implement speech recognition over the telephone. That means callers can conduct business with voice-enabled automated systems over the phone, simply by using their voice. Before we go any further, let's take a look at some basic speech terms.
A Short History
The first speech systems ran on large UNIX servers. Resource requirements were high. First, it required special digital signal processing (DSP) hardware to assist in the speech-recognition process, as most systems in those days simply weren't powerful enough to support very computationally intensive speech recognition. It required several hours of "training," and you also had to talk using a style known as discrete speech, where you inserted short pauses between your words ("youShadStoStalkSlikeSthis"). Even though it was not the most natural way to speak, for people who routinely dealt with large volumes of text (doctors and lawyers, in particular), it was a significant breakthrough. Then came continuous speech, which meant users didn't have to insert pauses between words. This continuous speech was limited to short commands and phrases, and dictation still required discrete speech. In 1993, a personal voice product, IBM's Personal Dictation System, was released for OS/2. It was one of the first commercially available, high-accuracy voice-recognition products. The following year saw the announcement of IBM VoiceType Dictation for Windows and OS/2 ("You talk, it types"). Four years later, IBM released the industry's first desktop continuous-dictation product. Users no longer had to pause between words, whether they were dictating or using commands, and could speak at a natural pace. Training requirements were decreased from several hours to several minutes, and recognition accuracy continued to improve. Over the past several years, voice technology has moved from the desktop to the enterprise in the form of voice middleware. Voice middleware encompasses platforms and applications that run on servers, such as IBM's WebSphere Voice Server, serving hundreds or thousands of customers via the telephone or Internet. Generally, these server-based voice applications are written to service a limited vocabulary and a large number of users, such as bank customers. No "training" of a caller's voice is required. An example of voice middleware is a customer service application that uses Web technology. This new application might give customers a voice interface to the same Web application content that had previously been accessible only through the Internet. For instance, a customer may now call a voice-enabled Web application server at a brokerage firm and complete a trade - without operator assistance. This is done by speaking commands and listening to the same information that might normally be "seen" using a browser on a PC or workstation. Another example of voice middleware is a voice-enabled flight information system, where a caller can receive flight information directly (such as late arrivals) rather than waiting on hold to speak to an agent. Today a caller can simply call a number, state the flight number and city, and receive the flight information audibly over the phone. At the same time that speech recognition was finding its way to the enterprise, it also moved to the device. Embedded speech technology now enables mobile devices, which are typically constrained by the amount and type of system resources available (memory, processor speed, and storage space) to deploy voice. Speech recognition can now be used on devices - providing low-resource, small vocabulary command-and-control speech recognition in a variety of languages. The software also supports a variety of real-time operating systems and microprocessors, making the development of robust mobile speech solutions easy and practical for both device and application developers. The convergence of computers with telephones and handheld devices continues. The human voice becomes a Web browser. Surf the Web in the car while a text-to-speech application reads back the content, then tell the car to turn on the radio when finished. Check the status of an order without having to punch a telephone keypad. Use a PDA to conduct a banking transaction without touching the keyboard. What could be easier than just talking?
Trends and Directions
There are several forces driving the growth:
This is happening not only in the U.S., but also across the globe. For the most part, companies looking to deploy voice face a lot of similar concerns. They want to know what business applications will bring more value to their customers and set them apart from their competition. Key to the growth of voice is VoiceXML. VoiceXML has been the catalyst for the deployment of speaker-independent, limited-vocabulary automatic speech-recognition systems in recent years. It has allowed for applications such as voice portals and speech-enabled call centers to grow, paving the way for Web access via phone. This allows call centers to automate simple customer requests, and use their live agents for more complicated tasks and inquiries, thus making more efficient use of them. The significance of VoiceXML is manyfold:
User: "I'd like to buy more Fidelity, please." System: "You currently own 500 shares of Fidelity Magellan. How many shares would you like to buy?" User: "What's the going price?" System: "The current price of Fidelity Magellan isSWhat else can I do for you? User: "HmmmmmSI think I'll buy 300 more" System: "Purchasing 300 shares of Fidelity Magellan at market priceSplease say 'yes' to confirm." User: "No, make that $2,000." System: "Purchasing 2000 dollars of Fidelity Magellan at market priceSplease say yes to confirm." User: "Do it." System: "2000 dollars, or 125.25 shares, of Fidelity Magellan purchased at market price. What else can I do for you?" When the user says "I'd like to buy more Fidelity please," the software must extract the meaning from what was said. Clearly, there are extra words in this sentence - words that don't affect the meaning of the sentence - and the software must extract the pertinent words from the sentence to determine its meaning. In this sentence, the key concepts are "buy" (an action), "more" (implying a fund the user already owns), and "Fidelity" (a fund company). As you can see from its response, the application determined that the user owns Fidelity Magellan, and then asks how many shares the user would like to buy. In this example, the user responds NOT with the number of shares to buy, but with a totally different question, "What's the going price?" This means the caller is in control, rather than being driven by long menus. At this point, the application must "switch gears" and obtain the current price of a fund. From the context, or what was previously said by the user, the application knows that the user wants the price of Fidelity Magellan. You can see from the example how NLU technology can make the interaction between a user and the system more intuitive and effective. And NLU is not just a vision of the future. One of the most sophisticated uses of NLU technology is being deployed by 401K management company T. Rowe Price. Their system is being rolled out to one million users who will be able to manage their retirement accounts using the enhanced Plan Account Line. The system doesn't require a caller to use a particular script. "We believe that most callers will save at least 30 percent of their time," says Heidi Walsh, vice president and senior marketing manager. Developments in speech-recognition technology haven't been limited to telephony. Dictation has also leapt from PCs to the server. Most recently, IBM announced the WebSphere Voice Server for Transcription. This offering was introduced in early 2002 and provided large-vocabulary continuous dictation to the enterprise. Aimed at solution developers and service providers with document-workflow solutions, the Voice Server for Transcription can automate what has traditionally been a very manual and resource-intensive process - that of dictation transcription. For many years, physicians, lawyers, and other professionals whose professions require the production of high volumes of text, have relied on typists and transcriptionists to convert their dictation into documents. With the WebSphere Voice Server for Transcription, the professional's dictation can be transcribed automatically, leaving transcriptionists to only correct and edit (rather than transcribing from scratch), thus improving their overall productivity and turnaround. Since skilled transcriptionists are expensive and hard to find, automated transcription makes the process more efficient. So what exactly is transcription? Take the example at the beginning of this article. If this audio were sent to the WebSphere Voice Server for Transcription, it would result in something like this:
Objective: Patient presents in mild distress and pain. Heart: Regular rate and rhythm. A transcriptionist would edit and correct the transcribed text, and the workflow application would use the edited text to fill in the appropriate fields (e.g., "Assessment" and "Plan").
The Case for Speech
Let's take a look at the financial industry specifically. With voice recognition, financial institutions can reduce contact center costs, particularly on nonrevenue-generating calls. The typical fully-loaded cost of a CSR is roughly $45,000 per year (assuming a base salary of approximately $36,000), or over $1-$1.50 a call. While IVR systems drive this cost down, voice enablement reduces it further by flattening menus and speeding up navigation. With voice recognition, call costs can be reduced to around 30¢ a call or less, depending upon call volume. Any increase in automated calls further drives down contact center costs. This kind of return on investment (ROI) means quick system payback. For example, some analysts claim that the payback on even "massive-scale, high-availability" voice-enabled contact center systems has been less than 18 months. One large financial corporation is currently voice-enabling its automated basic banking services, including balance inquiry and funds transfer. Their cost justification for using speech recognition is based on three predictions. First, they feel they can shorten incoming call length by flattening touch-tone menus and allowing a caller to jump to a desired action. Second, they feel they can increase the percentage of automated calls on a yearly basis by 2%, which represents a large number of calls. Last, they feel they can capture roughly 20% of calls where the caller does nothing and simply defaults to a CSR. With a voice-enabled interface, they feel they can get callers to use the system rather than wait for a transfer. There are many examples of enterprises adopting speech into applications in their organizations - and not just within the financial industry. While a business case can be built to show a quick return on investment, ROI is just one reason for justifying voice-enabled applications. Others include:
In the '80s, with the introduction of the PC, we increased the population of people who could access information. Then came the Internet, which used PCs as terminals to access the Web. Now we're moving to where we have small devices for an even larger population (including people who don't necessarily use a PC). Advances in voice technology have enabled people to speak directly to devices, rather than use traditional input methods such as the mouse or the keyboard. Soon people will be able to use speech when it's easier to say something than type it or wade through long menus, using a graphical interface when a visual representation serves your needs best, or using touch when that's the easiest way to make a selection. This is known as a multimodal interface. It combines all of the different ways to use technology, employing the most appropriate user interface to the task at hand. For example, consider a busy mobile worker on the way to the airport who receives a call from a manager, wanting him in Hong Kong for a customer meeting - instead of Tokyo, as originally planned. Using a cell phone, the worker calls the voice-enabled, automated flight reservation number of an airline and requests a list of available flights to Hong Kong. Since the worker is using speech recognition, he gets immediate attention, rather than holding for the next available operator. Shortly after hanging up, a schedule of all available flights is displayed on his wireless PDA. The worker taps a selection, sending it back to the airline reservation server. The flight is booked. The worker used the interface (whether it was speech, graphics, or touch) that was the most convenient to use at the time, all within the same task. In terms of the base speech technology, we'll continue to see improvements in recognition accuracy, including more languages and dialects supported and tools to make the creation of domain-specific language models easier. We'll see unstructured dictation (where you don't need to dictate punctuation and formatting) and we'll also start seeing speech recognition more fully integrated with other voice technologies, such as speech synthesis, language translation, speaker verification, and natural language understanding. Imagine getting the minutes from a conference call automatically transcribed for you, with individual speakers identified as they speak, and then automatically translating these minutes into other languages - all in real time while the conference call is going on. Speech recognition technology is being used on the desktop by doctors, lawyers, and even students to input large quantities of text...It's being embedded in PDAs and smart phones to make mobile computing easier and more naturalSIt's being deployed in carsSIt's being used by enterprises to enhance their self-service customer-facing applicationsSIt's being used in real time by professionals to fill in forms, such as insurance claims and trouble reportsSIt's being used at kiosks in airports, shopping malls, movie theatres, and theme parks so customers can get real-time informationSIt's being used in the home to control appliancesSthe possibilities are endless. Speech is quickly becoming a key user interface of choice. And, given its history in the past decade, significant progress will continue to be made; by 2010, speech recognition will truly be pervasive.
Conclusion
WEBSPHERE LATEST STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING WEBSPHERE NEWS
|
|||||||||||||||||||||||||||||||||||