(MAR 2009) — If you're looking to make a good speech recognition application a great one, adding weights to your grammar entries is a good place to start.
The ability to add weights is a powerful one, allowing you to in a sense rewire the internal workings of the LumenVox Speech Engine. Changing grammar weights helps the Engine better understand how your users speak, and thus helps improve accuracy.
Because of this tremendous power, adding weights can be very dangerous, and you can harm recognition accuracy in addition to improving it. This Tech Bulletin will cover the basics of weighting and outline some good testing strategies to ensure you get the best results from grammar weighting.
The LumenVox Speech Engine is essentially a probability machine. It compares mathematical features of audio to its internal acoustic model, and decides which sounds in a language a piece of audio represents. This decision is a "best guess" or probability, and is based on a number of factors.
Without getting too detailed, grammar weights allow you to alter the probabilities the Engine uses when making these decisions. So when the Engine has to choose whether a given bit of sound represents the words "New York" or the word "Newark," a weight can help push the Engine toward the alternative preferred by the grammar designer.
As an application developer, this gives you more control over the Engine and allows you to fine–tune recognitions. You can think of weighting as allowing you to supply external information about users or the application to the Speech Engine.
Using the example from above, consider an application that asks users which U.S. city they live in. Assuming a random distribution of users from throughout the U.S., New York City should be the most common response, as it has the largest population.
Newark, a city in New Jersey, has a much smaller population than the Big Apple and therefore would have a much less frequent response rate. But the Speech Engine has no way of knowing that, and because "Newark" and "New York" have very similar pronunciations, the Engine is likely to mix them up.
By default, the Engine considers all grammar entries to be equally probable as responses, and picks answers based solely on their sounds. By adding weights to a city grammar, you could tell the Engine to favor "New York" over "Newark," and in general to favor bigger cities over smaller ones, if you applied weights based on population.
You can also use weights to improve existing applications by transcribing responses from users and determining the frequency of responses and finding common recognizer errors. If you find that users are consistently having trouble being understood when they say a particular word or phrase, you can weight that item more heavily to help correct the Engine.
The exact syntax for adding weights is simple: in an ABNF grammar, you add weights to an item by enclosing a number between two front slashes, e.g.:
$rule = /10/ yes | /1/ no;
For GrXML, apply the weight attribute to the item element, like:
<item weight="10"> yes </item>
<item weight="1"> no </item>
In both examples, we have a grammar that allows for a user to say either "yes" or "no," but the "yes" answer is weighted 10 times more heavily than the "no" answer.
Weights are relative to one another, with each grammar item having a base weight of 1. When you apply a weight to an item, the base weight is multiplied by the weight you supply to come up with a modified weight. It is this modified weight that will be used by the Engine.
This means that to increase the likelihood of a grammar item being returned, you supply a weight greater than 1. To reduce the likelihood of an item being returned, you supply a weight between 0 and 1 (weights should not be negative).
Internally, the Engine performs a search when supplied with audio. It compares the acoustic features of the audio with its built–in acoustic model. Based on the quality of matches between the features in the audio and the features in the acoustic model, it returns one or more acoustic matches.
At the same time, the Engine analyzes the grammar and builds a language model. It parses the grammar and determines what utterances are valid responses per the language model (so an utterance of "Please" is not a valid response for a straight yes/no grammar).
These two models influence one another. The language model helps restrict the acoustic search, and the acoustic search helps restrict parses through the grammar. The Engine does not try and follow paths through the acoustic model that are not allowed by the language model, and vice versa. The Engine ultimately calculates an acoustic score and a language score for any given answer, and combines them to come up with an overall score for the result.
When you supply grammar weights, you are influencing the language score. The language model applies penalties to the acoustic search, making certain paths less likely. It takes weighting into account by first normalizing all weights on a 0–1 scale. If you supply one item with a weight of 2 and another with a weight of 1, the first item is normalized to 0.66 and the second to 0.33 (the first item is twice as likely as the second).
To summarize: weighting is a complicated activity, that may not work exactly how you would expect it to. The more drastic the weight the more effective it is, but it is also possible to make weights so drastic that the Engine never returns anything but that item. It is important to experiment and test weights carefully.
As outlined above, grammar weighting can be tricky. Because you are influencing complex interactions internal to the Speech Engine, you will want to experiment heavily and perform testing before deploying any new grammar weights.
To do this, you will need some testing tools such as the LumenVox Speech Tuner, and a good set of transcribed data from users of the speech application. This will allow you to add weights to a grammar, run audio through the Engine again, and see how the accuracy changed.
Ideally, you will want to have two sets of data with which to test and weight. The first set is the one you will analyze heavily in order to determine what sorts of weights to apply. You will look at the frequency of responses, find common errors, and go through the normal tuning process with an eye toward weighting.
The other test set you will set aside and not analyze, saving it for when you are happy with your weights. This set is a sort of control set that you will use for a final round of testing. The idea is to avoid the situation where you overtune a grammar for a specific data set.
There is a danger that once you analyze a particular set of utterances too much, you become overly familiar with the quirks inherent to any sample. It's basic probability and statistics that any sample you take will vary from the mean in some ways.
Thus you can build a grammar that works great for a specific sample but less well for the average caller. This is where having a control comes in: you test your changes against a control set you have not tuned for. If you are happy with the results of your new grammar using the control, you can feel much safer about putting those changes live.