It’s often hard to evaluate the performance and quantify the potential impact a tool can have on the business.

 

Summary

  • Beyond accuracy, the False Positive and False Negative rates are sensible, intuitive ways of assessing performance
  • Not all anomaly detectors are equal: performance scores can differ substantially between anomaly detectors, operating on the same real-life time-series data for business metrics
  • In our test data, Avora’s anomaly detector achieves better performance compared to Facebook Kats, with significantly lower False Positive & Negative rates, but comparable accuracy
  • Even lower False Positive/Negative Rates can be achieved with hyper-parameters tuning, with no reduction in accuracy

Introduction

Every business across the world has increasingly more and more data it can use to analyse performance and make data driven decisions. However, quite a few companies find themselves with too much data that can’t be possibly tracked and analysed by people. As a result, AI powered business intelligence tools and specifically Anomaly Detection, play a more and more important role in business success.

There is no scarcity in offers and solutions in business intelligence, but it’s often hard to evaluate the performance and quantify the potential impact the tool can have on the business. Among the reasons that make the evaluation hard are:

1. Lack of performance comparative datasets that relate to noisy, real-life business performance data
Performance is described using complex scientific metrics that are not easily translated into the business world.
2. In Avora we have created an evaluation pipeline using real life, time-series based on business data to benchmark Avora’s performance against the well known Facebook Kats Anomaly Detector, closely linked to the popular Facebook Prophet package.

Intuitively Measuring & Explaining Performance

Beyond accuracy, the most commonly used metrics when evaluating anomaly detection solutions are F1Precision and Recall. One can think about these metrics in the following way:

  • Recall is used to answer the question: What proportion of true anomalies was identified? It is calculated as:

  • Precision answers the question: What proportion of identified anomalies are true anomalies?

  • F1 Score identifies the overall performance of the anomaly detection model by combining both Recall and Precision, using the harmonic mean

For example:

You are tracking Sales metric as one of your KPIs. You receive a notification that 10 anomalies have been identified. You check the graph and confirm that only 6 dates out of 10 are indeed anomalies. However, you also notice that there are 9 other dates for which the Sales metric behaved unusually and you would consider them to be anomalies. So now you have 6 correctly identified anomalies (True Positive), 4 incorrectly identified anomalies (False Positives) and 9 missed anomalies (False Negatives).

In this scenario the metric values would be:

  • Recall: 6 / (6 + 9) = 0.4
  • Precision: 6 / (6 + 4) = 0.6
  • F1 Score: 2 * (0.4 * 0.6) / (0.4 + 0.6) = 0.48

False Positive and False Negative Rates

When deciding on the anomaly detection system it is important to pay attention to metrics: False Positive and False Negative rates.

False Positive rate helps you understand how many times, on average, will your detector cry wolf and flag the data points that are actually not true anomalies.

In the example above, the False Positive rate is 0.4 or 40% — the system identified 10 anomalies of which only 6 were True anomalies. This means that 40% of the anomalies detected were in fact, not anomalous at all.

Pick the system with the lowest possible False Positive rate. If the False Positive rate is too high, the users will turn off the system since it is more distracting than useful.

False Negative rate shows how many anomalies were, on average, missed by the detector.

In the worked example the False Negative rate is 9/15 = 0.6 or 60%. The system identified 6 true anomalies but missed 9. This means that the system missed 60% of all anomalies in the data.

Choose the system with the lowest possible False Negatives rate. If the False Negative rate is too high, you will be missing a lot of crucial anomalies and in time you will lose trust in the system.

Methodology

For the performance comparison we created a system that aimed to provide an objective, unbiased evaluation.

1. Time-series data from real-life examples was collected & anonymised.
2. Anomalies were manually labelled prior to performance evaluation — this was stored as the ground truth dataset, based on the analyst’s assessment, independent of results from either algorithm.
3. The ground truth dataset was used by a pipeline that only performed the evaluation after both Avora and KATS anomaly detection algorithms completed the labelling.

Performance evaluation set up and stages

Results & Analysis

Below you can see the results for Avora and Kats on 19 anonymised datasets, spanning multiple business domains. These results are representative of how the algorithms would perform in a production environment.

Key metrics results for Kats and Avora algorithms

Kats and Avora performance metrics for each dataset used

Taking an average across all datasets, Avora’s anomaly detection algorithm achieves higher scores in every measure (Accuracy, Precision, Recall, F1) than Facebook KATS. It is also worth mentioning that with KATS, Precision and Recall have a greater proportion of extreme scores (0 or 1), depending on the dataset.

Whilst encouraging, these measurements highlight areas of improvement, particularly around Precision (average 0.51 for Avora’s algorithm).

Examining the False Positive and False Negative rates, we see c.80% of anomalies (as labelled in the ground truth) in these ensemble of datasets are not detected by KATS — which may lead users to question the effectiveness of deploying the Kats algorithm in its current form on real-life business data.

Kats and Avora False Positive and False Negative Rates

Time-Series & Anomalies Comparison

We will cover the results of 2 datasets in more detail: Dataset ID 85693 and Dataset ID 2258. Unlike Avora, Facebook KATS outputs do not provide Detection Envelopes, only marking specific data points as anomalous.

Online Sales — Dataset ID 2258

The graphs below show the outputs from Manual Labeling, Avora Anomaly Detection labeling and Facebook KATS Anomaly Detection labeling for Dataset ID 2258. In this case, the analyst providing ground truth identified high spikes throughout 2019–20 and a significant dip in summer 2020.

Avora’s algorithm has successfully identified both the spikes and the summer dip as anomalies. On top of that it picked up a smaller dip in late autumn 2019 and other smaller spikes and drops. Some may argue that some of those additional anomalies marked by the algorithm (e.g. Nov 2019) could be considered as anomalies in the ground truth.

Facebook KATS algorithm has only managed to pick up the spikes and did not identify any of the longer dips as anomalies — this is a concern, as sudden drops in sales or revenue are of great interest to any business.

Manual Labeling, Avora output and Kats output for dataset ID 2258

API Gateway Errors — Dataset ID 85693

In the ground truth, the analyst labelled a spike in Apr 2021 and 2 short term dips in Nov/Dec 2020 as anomalies. Avora Anomaly detection managed to pick up all the anomaly areas.

Kats, on the other hand, has only managed to pick up the spike itself, and introduced a number of false-positives in the lead up to the Apr 2021 spike — we interpret KATS to be less sensitive to changes of smaller magnitude, which may still be useful for users.

Manual Labeling, Avora output and Kats output for dataset ID 85693

Further Research — Lowering False Positive/Negative Rates

This comparison was performed using default parameters for each algorithm — it was not possible to locate parameters that allowed Kats to change the anomaly detection sensitivity.

Avora’s anomaly detection system allows users to modify the input parameters to improve the results — a natural step for users to take. By searching the parameter space, we observed that results can be significantly improved, if input parameters were optimised for individual datasets.