It’s often hard to evaluate the performance and quantify the potential impact a tool can have on the business.



  • Beyond accuracy, the False Positive and False Negative rates are sensible, intuitive ways of assessing performance
  • Not all anomaly detectors are equal: performance scores can differ substantially between anomaly detectors, operating on the same real-life time-series data for business metrics
  • In our test data, Avora’s anomaly detector achieves better performance compared to Facebook Kats, with significantly lower False Positive & Negative rates, but comparable accuracy
  • Even lower False Positive/Negative Rates can be achieved with hyper-parameters tuning, with no reduction in accuracy


Every business across the world has increasingly more and more data it can use to analyse performance and make data driven decisions. However, quite a few companies find themselves with too much data that can’t be possibly tracked and analysed by people. As a result, AI powered business intelligence tools and specifically Anomaly Detection, play a more and more important role in business success.

There is no scarcity in offers and solutions in business intelligence, but it’s often hard to evaluate the performance and quantify the potential impact the tool can have on the business. Among the reasons that make the evaluation hard are:

1. Lack of performance comparative datasets that relate to noisy, real-life business performance data
Performance is described using complex scientific metrics that are not easily translated into the business world.
2. In Avora we have created an evaluation pipeline using real life, time-series based on business data to benchmark Avora’s performance against the well known Facebook Kats Anomaly Detector, closely linked to the popular Facebook Prophet package.

Intuitively Measuring & Explaining Performance

Beyond accuracy, the most commonly used metrics when evaluating anomaly detection solutions are F1Precision and Recall. One can think about these metrics in the following way:

  • Recall is used to answer the question: What proportion of true anomalies was identified? It is calculated as:

  • Precision answers the question: What proportion of identified anomalies are true anomalies?

  • F1 Score identifies the overall performance of the anomaly detection model by combining both Recall and Precision, using the harmonic mean

For example:

You are tracking Sales metric as one of your KPIs. You receive a notification that 10 anomalies have been identified. You check the graph and confirm that only 6 dates out of 10 are indeed anomalies. However, you also notice that there are 9 other dates for which the Sales metric behaved unusually and you would consider them to be anomalies. So now you have 6 correctly identified anomalies (True Positive), 4 incorrectly identified anomalies (False Positives) and 9 missed anomalies (False Negatives).

In this scenario the metric values would be:

  • Recall: 6 / (6 + 9) = 0.4
  • Precision: 6 / (6 + 4) = 0.6
  • F1 Score: 2 * (0.4 * 0.6) / (0.4 + 0.6) = 0.48

False Positive and False Negative Rates

When deciding on the anomaly detection system it is important to pay attention to metrics: False Positive and False Negative rates.

False Positive rate helps you understand how many times, on average, will your detector cry wolf and flag the data points that are actually not true anomalies.

In the example above, the False Positive rate is 0.4 or 40% — the system identified 10 anomalies of which only 6 were True anomalies. This means that 40% of the anomalies detected were in fact, not anomalous at all.

Pick the system with the lowest possible False Positive rate. If the False Positive rate is too high, the users will turn off the system since it is more distracting than useful.

False Negative rate shows how many anomalies were, on average, missed by the detector.

In the worked example the False Negative rate is 9/15 = 0.6 or 60%. The system identified 6 true anomalies but missed 9. This means that the system missed 60% of all anomalies in the data.

Choose the system with the lowest possible False Negatives rate. If the False Negative rate is too high, you will be missing a lot of crucial anomalies and in time you will lose trust in the system.


For the performance comparison we created a system that aimed to provide an objective, unbiased evaluation.

1. Time-series data from real-life examples was collected & anonymised.
2. Anomalies were manually labelled prior to performance evaluation — this was stored as the ground truth dataset, based on the analyst’s assessment, independent of results from either algorithm.
3. The ground truth dataset was used by a pipeline that only performed the evaluation after both Avora and KATS anomaly detection algorithms completed the labelling.

Performance evaluation set up and stages

Results & Analysis

Below you can see the results for Avora and Kats on 19 anonymised datasets, spanning multiple business domains. These results are representative of how the algorithms would perform in a production environment.

Key metrics results for Kats and Avora algorithms

Kats and Avora performance metrics for each dataset used

Taking an average across all datasets, Avora’s anomaly det