An old proverb says, "Man plans, God laughs." Nowhere is this more true than in software development. As developers, we try to plan for every contingency. In our minds we walk down every path our users may travel and compensate accordingly. Without fail, customers will traverse paths never intended for a user's feet. Immediately they come back to us and say "Look, it's broken!" No matter how much we plan, we must be able to modify our systems based on how our users actually use them, not on how we expect them to be used. In the world of Voice User Interfaces, this process is called tuning, and a tuning tool is a vital component of VUI development.

A good VUI is difficult to write and test. When users complain that an application is "broken" it may mean that the speech recognition is failing, but just as often it means that the developer of the application simply did not take into account the sort of responses users would give. Unlike DTMF applications — or even traditional GUI applications — an application with a speech–driven VUI allows for an endless number of potential user responses to any given prompt.

As an example, LumenVox produced a VUI demonstration that would illustrate the capabilities of our ASR. The demo would tell people the current weather for a city of their choice. Using U.S. Census data, we built a grammar with the name of every city with a population of more than 5,000 people. Once a user selected a city and state, the system retrieved weather information from the Internet and read it to them using text–to–speech.

It was a fairly straightforward design, with no obvious snags or hang–ups. Almost immediately after the system was deployed, however, users reported it was failing. We immediately called the system to ensure it was working:

Speech Application: Please tell me the area you would like the weather for.

Caller: San Diego, California.

Speech Application: I heard "San Diego, California." Is this correct?

Caller: Yes.

Speech Application: The weather for San Diego, California is...

The voice interface we had designed seemed to work fine from a technical perspective. The speech recognition was accurate, and all the components were working together as expected. And yet users kept reporting the system was failing. It wasn't until we reviewed actual recordings of calls using our Speech Tuner that the problems in the system's design were exposed. Our Speech Tuner allows us to listen to the audio recordings of callers, see what the Speech Engine recognized, and see how changes to grammars would have affected recognition.

One key feature of the Tuner is its Call Browser, a module that allows us to see details about a call and each utterance in that call. This way we can follow a user through a call, see what the caller said, and see what the Engine recognized the response as. A common user experience went like this:

Speech Application: Please tell me the area you would like the weather for.

Caller: 92123.

Speech Application: I am sorry, that is not a valid choice. Please try again.

Caller: ZIP code 92123.

Speech Application: I am sorry, that is not a valid choice. Please try again.

Caller: Area code 619.

Speech Application: I am sorry, that is not a valid choice. Please try again.

Caller: The moon, or anywhere nearby.

We listened in horror as the users ripped our robust application to shreds. While a developer cannot plan for every possible phrase a user may utter, it was clear the prompt was misleading our callers. The seemingly simple request of "Please tell me the area you would like the weather for," was far too open–ended. We heard responses such as "Near my house," and "The beach."

Needless to say, after reviewing the results of our seemingly simple and fail–proof application, we decided that the initial prompt needed to be changed to something that elicited a specific response instead of such an open–ended question. We changed the prompt to say, "Please tell me the city and state you are interested in," and the application's success rate improved significantly. By reviewing actual calls with the Tuner, we were able to quickly pinpoint the exact cause for user failure, and adjust the system accordingly. Just as importantly, we were able to review the results of the change to ensure it was successful.

The next demonstration application we developed was a fake pizza–ordering application. This demo allows users to choose the toppings, size, and crust of a pizza. As with the weather demo, we built the initial application, tested it internally, and deployed it. Once again, users immediately complained that the application simply did not work. When the system asked users what size pizza they wanted, we expected them to ask for a small, medium, or large pizza. Listening to calls, we heard interactions such as:

Speech Application: "What size pizza would you like?"

Caller: "Twenty–seven inch, please."

Speech Application: "Hey, we only make three sizes of pizza: small, medium, or large."

Caller: "Medium, then."

Speech Application: "Was that a medium?"

Caller: "Yes."

Even though the mistake was handled by the "no match" prompt, this failure to answer an ambiguous question the way we expected could lead to caller fatigue and frustration. This condition can occur if callers encounter just a few questions that don't move them forward to the implied goal of the system (in this case, ordering a pizza).

Using our Speech Tuner, we were able to provide immediate and satisfying proof of the need for a change in the system. We added a grammar that would accommodate users specifying a pizza size in inches, as well as being able to say "small," "medium," or "large."

Unlike the weather system, we decided to add the grammar to accommodate a larger range of responses instead of simply rephrasing the prompt. In the case of the weather demo, there was no reasonable way to accommodate requests like "I need weather near my house." But the responses to the pizza demo were within a limited domain that could be easily handled by a modified grammar, so it made sense to make that change.

Before we made the changes to our live application, we needed to test the grammar. To do this, we transcribed a large number of utterances — the Tuner's built–in transcriber aids us in this by automatically entering the Speech Engine's result into the transcript, but for out–of–grammar utterances we still need to spend time transcribing what the users said.

Once we had transcribed call data, we were able to make use of the Speech Tuner's Grammar Tester component. The tester takes transcribed interactions and gives us a list of the grammars that were active during the recognitions. It also gives us a wealth of statistics about recognition accuracy, based on the transcripts and the recognition results.

The key feature of the tester is the ability to modify the grammars and then run the audio through the Speech Engine, getting new results based on our new grammars. This allowed us to evaluate how the application would handle our original responses with the new grammar entries (the ones that allowed users to specify a size in inches). We saw our semantic error rate drop significantly, because our grammars now accommodated what the users were saying. The grammar test provided us with empirical data to show that the change would be a beneficial one that could be done swiftly and without negative user impact.

Using a good tuning tool allows developers to quickly harness the only experiences that really matter: those of users. Only when we understand how our users actually interact with our speech applications can we then plan improvements. And, with effective testing tools, we can accurately assess how those changes will affect our applications before deploying them in production environments

Software Type