In this article, we will discuss how to use different metrics to evaluate your trained intent model and the overall performance of your Conversational AI Cloud project. We assume that you already know how to set up an intent model in Conversational AI Cloud and that you have one trained. If you're not familiar with terms like precision, recall, and F1-score, please take a look at our metrics article.
Optimization can be done at two points in your project's maturity: before going live or after going live. The tools you'll use will be the same, but the insights you'll gain will be slightly different. In this article, we propose a process for evaluating the performance of your intent model and making informed decisions based on that performance. This process will stay relevant throughout your intent model's lifespan.
To properly optimize your intent model, you need to ensure that your intents are set up, your intent model is trained, and each intent has the minimum of 5 recommended test phrases. If you lack test phrases, add them first to each individual intent. To remind you, a test phrase for an intent is an example of what your customers might say to trigger that intent, similar to an utterance, except that it's
used after training for validation purposes.
Confusion Matrix
When evaluating the performance of an intent model, we calculate precision, recall, and F1-score. These metrics are determined by running each test phrase through the newly trained intent model, which classifies each test phrase under one of four categories:
True positive: the test phrase was supposed to match to intent X, and actually matched to intent X.
False negative: the test phrase was supposed to match intent X, but didn’t.
False positive: the test phrase was not supposed to match intent X, but did.
True negative: the test phrase was not supposed to match intent X, and it didn’t.
Once we have these four categories mapped for each training phrase, they can be placed in a matrix to visualize the overall performance of our intent model. See the example below:
However, this matrix only works for a single intent. Since we have multiple intents, the matrix will be shown like this in Conversational AI Cloud:
This confusion matrix shows how the model is performing with the test data provided, and how often it's correct when it's supposed to be. If the intent is too similar to another intent, there may be many false positives or false negatives for those two intents, as the model confuses them when classifying input.
So far, we've explained what the data in the confusion matrix means, but we haven't discussed what can be done with this data.
Determine What To Optimize
Firstly, it's not necessary for your model to achieve 100% precision and recall. In real-world scenarios, there's often a trade off between precision and recall. Optimizing for one will lead to a decrease in the other, and vice versa.
Once you've decided to optimize your model, it's important to determine which intent to optimize next. We suggest selecting your lowest-performing intents overall and using them as a basis for your optimization efforts. You can easily identify your lowest-performing intents by examining the "Precision, recall, and F1-score" graph shown in the image below:
In the example above, "Example 4" is the intent with the lowest F1-score at 32%, while "Example 3" has an F1-score of 78% and "Example 5" has achieved an F1-score of 94%. It's likely not the best use of your efforts at this time to optimize the best-performing intents when an intent like "Example 4" is significantly under-performing. However, this may vary based on historical insights from your customers. In those scenarios, make sure your test phrases are properly re-aligned with what your end-users are asking to keep the metrics as relevant as possible.
Intent Classification Threshold
To improve the performance of your intent model, you can adjust the intent classification threshold in your project's admin portal environment. Keep in mind that changing the classification threshold will affect your precision and recall. If you set a high threshold, your model will have a "higher standard" for positive classifications, resulting in fewer overall classifications but better precision. A lower threshold, on the other hand, will cause your model to have a "lower standard" for positive classifications, resulting in more positive classifications but also a higher risk of miss-classifying true negatives as positives, which will improve your recall.
It is best to use the classification threshold once your model has matured and you have a good understanding of your precision, recall, and what your customers value most based on their feedback. Be cautious when adjusting the classification threshold, as it may have a larger impact than you expect.
Ways to Optimize Your Intent Recognition Model
This article covers ways to optimize your intent recognition model. To approach this, you need to have clear goals for your Conversational AI Cloud project and your organization.
In this section, we provide different actions you can take to optimize the model in any given direction. It is up to you (and your team) to decide which direction to take.
Improving overall model quality:
If you see one intent with a lot of confusion (false negatives/false positives) with one or more other intents, it is likely that:
This intent is not unique enough to be its own intent and should be merged with one of the intents it is being confused with. Evaluate this for the intent and the ones it is being confused with.
This intent's test phrases have poor quality. Perhaps some of the test phrases should be put under a different intent.
If you see one intent that a number of other intents are being confused with, it is likely that:
This particular intent's utterances are too general, or contain utterances that would be better served under a more specific intent.
- If you see one intent having an ever-growing list of utterances with a large amount of articles linked to it, it might be wise to split it out into two or more separate, more specialized intents. This may provide a smoother conversational experience for your customers. Be careful with this optimization, as splitting an intent should only be done if the intents that it is being split into have enough variance in utterances to warrant a split, instead of causing intent confusion.
Optimizing for precision or recall:
For precision, make your utterances very specific to what your customers are asking. Avoid more generalized utterances as they'll capture more user questions, which may not be correct.
For precision, avoid confusion between intents as much as possible. Dive into the utterances you have defined between confused intents and be very critical of the utterances you add or don't add.
For recall, make your utterances more generic. You can easily expand utterances over time as this will add to your overall recognition.
For recall, let your customers' questions drive your optimizations. No recognition questions are your worst enemy, so add them as utterances to your intents.
For precision, increase your intent classification threshold. This will result in a lower recognition rate, but the questions that do get recognized are much more precise.
For recall, decrease your intent classification threshold. You'll loosen some of the restrictions placed on a positive classification, but you'll keep your customers more engaged even if that means triggering the wrong intent on occasion.