In the second part of our two-part series on converting DTMF applications to use speech recognition, we will look at a few types of speech applications, how to work with text to speech applications, and some considerations in building grammars.
This type of application uses open-ended question. For example, "Hi, thanks for calling company X, what would you like to do?" This type of question has manypossible responses. A user would interact with the application naturally and say what they would like. The danger of natural language is that the number of natural responses is much, much higher then the developers had realized in advance. What occurs then is that a developer would have to account for so many responses that the application tends to be more difficult to develop. Natural language is good for the user in that it is nicer and easier and to use, but if the developer fails to account for a user's response it suddenly becomes less useable. Generally accuracy is a little lower with Natural Language, because there are larger grammars that mean a greater chance for mistakes. Many times instead of a natural language application, we recommend a directed dialog application.
Ask the user specific questions and provide guidelines as to what they should say. So instead of asking, "What you like to do?" we would instead say, "Hi and thanks for calling Bank X, would you like information regarding your checking account, savings account or to make a transfer?" The user may then respond, "Checking account," or "Access my savings account," or "Make a transfer." We have directed the user to say what we want to hear, and the number of responses has become much narrower. This means that the application has become easier to build, develop and test. Also, accuracy increases which makes it easier on everyone involved. This is why directed dialog is what we recommend, especially if you're new to speech application development.
is a very important part of directed dialog. You may want to avoid prompts such as "For checking, say checking." This type of dialog is a holdover from the DTMF days. Instead, simply tell your callers that they can access their checking account and they will figure it out. Or a prompt that says, "Tell me what you would like to do: you may access checking, savings?" The user can figure out that saying "access checking" will get them to where they would like to be. Prompts, when done well, should give the user a mental model as to how the application works.
The Speech Engine works based on probabilities. It's never 100 percent certain that it understood what the caller actually said. So the engine returns what thinks the user said and an associated score, which is referred to as a confidence score. The confidence score represents how likely it is that the Engine got the user's speech correct. So the Engine may have heard the word "Checking" but there's only a 60 percent chance that it was heard correctly.
So then the application looks at that confidence score and may determine that the score is low, and prior to transferring the user to the checking account a confirmation is presented to the user. "Would you like access your checking account?" Then the user can say yes or no. It's important not to use confirmations every time a caller makes a response. Only confirm when the confidence score is low. So if the Engine returns a confidence score of 98 percent there is no reason to frustrate the user for a confirmation.
It should obviously be used for dynamic data when reading from a database that cannot be prerecorded such as account balances. TTS is great for those occasions and there are great TTS engines. However, generally for static prompts you'll want to work with actual recorded voices because speech is intended to be more personable, more engaging and recorded human voices definitely fit the bill better then text to speech engines.
If you've done a good job designing prompts your grammars are easier to design and develop. The key rule with grammars is to keep them as small as you can while still providing good coverage. In case you're not familiar with grammars, they are lists of words that you'll want recognized at a single time. So at a given prompt you'll have a grammar, the grammar is all the words that can be recognized at that prompt.
As grammars get bigger, there is a greater chance that the Engine can mistake what the user said for another word in the grammar. So smaller grammars generally have a higher accuracy rate.
If you have good directed dialogue prompts with clear options available to the caller, most of the users will provide predictable responses. You'll want to design the grammar to accommodate those users. You don't wan to try to listen and cater to the 2 to 5 percent of callers who are calling and making outrageous responses, or responses that simply don't make sense. You do not want to add this type of nonsense into your grammar because you're never going to really be able to handle these odd ball callers because if they're using the system inappropriately it will be near impossible to predict what they are going to say. Also, if you increase the grammar for this small percentage of callers, the rest of the callers who are behaving appropriately have a greater chance of getting misrecognized for one of the items that was added. So, design your grammar for the majority of your callers and not the anomalies.
You will also want to avoid words that sound similar, or phrases that rhyme. Making each option distinct is always best to improved accuracy.
As you deploy the application, you'll want listen to what the users are saying during call sessions, observe how the users are navigating the application. This is part of what is called tuning, and you may find that there are certain words and phrases in your grammar that users never say or rarely say. If possible you'll want to remove that from your grammar, prune the grammar to keep it as lean as possible.