Building An Anomaly Detection System? 5 Requirements It Must Meet
Building An Anomaly Detection System? 5 Requirements It Must Meet
Recommendations for a successful solution — from signal analysis to sharing insights at scale
If you’ve found your way here, chances are your business has identified anomaly detection as a way to significantly reduce costs or find new opportunities to grow — and you’re in the early stages of planning a solution.
It is definitely possible to build your own anomaly detection system — but to make it successful, you need to meet these five requirements.
These five tests won’t just save you time and effort, they are the difference between people actually using your system versus looking at it once before condemning it to the bin along with other tech solutions that sounded nice, but didn’t work in real life.
So if you are just pursuing a data science project for the fun of it and you enjoy the journey as much as reaching the end goal — by all means, feel free to skip this post and carry on exploring what you can do.
But if you are serious about building a state-of-the art system that uses time-series data — everything from signal analysis to presenting outputs at scale — keep reading.
(Not yet familiar with Anomaly Detection? Here are a few primers):
Users find it hard to trust the system’s outputs if they doesn’t pass the “sniff test” Intuition matters to users. They won’t trust something if they can’t see how it fits together — for example, why is an anomaly marked, and are the trend & seasonality being correctly calculated? Particularly for data points that are marginally considered anomalies, this helps build confidence in the system’s abilities.
A good approach is to break down the signal into its key components: Trend, Seasonality, Residual (see Hyndman et al for a quick reminder). Detection is performed on the residual. This is the “noise”. This doesn’t have to be a specific distribution; real life isn’t always that simple!
In addition — very few commercial datasets have anomalies pre-labelled, it’s highly unlikely that historical anomalies will have already been identified!
Always break out your signal components and verify them — do this for peace of mind, but also to benchmark against your intuition and the intuition of your users. Show them something that they can explain, and make sure that when you pull it apart, it still makes sense.
Set some sensible thresholds for the non-parametric method of detection. (Note — you’ll need to build in extra time / resource here for the R&D. Give users a choice as to the threshold — what is an anomaly for one person might not be for another.
Prevent an issue coming up because you made an assumption that the data was pre-labelled, and instead go in with the assumption that your data is unlabelled. No one has time to go back and mark hundreds or even thousands of time-series charts with points they considered anomalous
Requirement 2 — Your users won’t stop at one metric. It needs to be accessible at scale
You start at the beginning — by building the core algorithm. But the use case nearly always becomes more complex very quickly, as people see the value of it working on a specific example. For example, an analytics team could be tracking the revenue for an overall sales metric, picking up some interesting anomalies. Word gets out to other stakeholders and pretty soon, rather than just tracking that one top-line metric, people start asking for tracking at more granular levels (e.g. revenue by country, product category or acquisition channel), quickly producing hundreds or thousands of new time-series.
And there is no point in having an anomaly detection system that they can’t interpret. These users aren’t going to wade through that many time-series charts. Even if they knew how to, there’s no way they’d have time.
Theoretically it’s possible to run these (technical / infrastructure resource aside), but presenting the output back to users in a way does doesn’t overload them with information is absolutely essential.
Think through how your system will only share anomalies relevant to a certain user. Otherwise, any potentially valuable signal will get lost in the noise.
Requirement 3 — You’ll need to minimise false positives. Sooner than you’d think
No matter how good your algorithm is, no anomaly detector can provide 100% correct yes / no answers. False positives and false negatives will always exist, and there are trade-offs between the two.
The human operator will have to make a decision, and even within the same team, there may well be dissent as to what constitutes an anomaly.
It’s tempting to ignore this in the initial build process — but not addressing this up-front carries some risks — the anomaly detection system can overwhelm users with “storms” of alerts — which, to a business user with no context on data flows, may feel like the system is not curating the notifications enough.
False positives can come in two forms:
1. Data is unexpectedly or inexplicably missing / incomplete for a period.
2. Data is complete, but the detection incorrectly marks an anomaly.
If data is missing/incomplete/late, wait until data is complete. But how long you need to wait will depend on how often that metric is refreshed — there are different ways of solving this, either manually or programmatically. The former relies on “intuition” and historical knowledge, and is unlikely to scale.
If data is complete, but you’re still getting false positives — check the decomposition results and the residual calculation — does it make sense to a human being? If not, this will likely result in users not trusting the outputs. (We talked in Requirement 1 about the importance of passing the sniff test). Here, make sure there’s a way to tune/tweak your algorithm based on your findings here, or better yet, allow the user to single these data points out for you to investigate!
Warning: In extreme cases, users will simply switch off the anomaly detection system to stop being swamped with false alerts, and all the good work that’s gone into building it will have been a waste
Requirement 4 — Your system needs to account for events and “known” anomalies
Anomaly detection algorithms on business data often pick up “big” events — examples include Black Friday, Christmas, Easter, sales promotions for a business — that are known ahead of time. These aren’t surprising to users, who anticipate them and expect the anomaly detection to do the same.
Take these events into account can come in two ways:
Suppress any notifications, or provide with a context narrative, if the anomaly coincides with a known event;
Modify the expected value for a time range with a known event, using estimates based on historical observations. For example if we know that Black Friday generally sees a 80% jump in sales, the next time we come up to a Black Friday and see a jump in sales the expected spike should be taken into account when considering whether the anomaly is noteworthy.
Requirement 5 — Make it easy for users to share insights with other stakeholders, and to find out why
After all this hard work, it’s finally time to share these anomalies outside of the IT / BI teams! This is what you’ve been working towards.
This part needs to be carefully planned and executed. There are some important drivers:
Multiple users will want access to the same tracked metric (and they won’t want to create them individually);
Hundreds / thousands of metrics will need to be tracked as requested by the users;
Not every metric will fire an anomaly — you will need to direct users to only unusual activity that has occurred recently, to avoid drowning them in data;
Users will want to get an easy-to-understand, concise message showing what anomaly has been identified (and why, if possible!)
All of this needs a front-end and back-end working in unison — not just a database (which, for most users is at least a couple of steps too far in terms of accessibility). A traditional dashboard like PowerBI, Tableau or Looker probably won’t cut it here — they aren’t designed for this type of use case.
The Elephant in the Room:
After finding an anomaly (e.g. Revenue for the UK is unusually low), users will quickly want to find out why — this can be tricky to answer. This is where Root Cause Analysis can help — watch out for that in a future post!
To Sum Up
Getting a production-ready anomaly detection system working well for business users is more than just setting an algorithm loose on your data sets. If you’re a hands-on data scientist, I encourage you to give it a try using the above tips.
It can help to benchmark, so if you’re interested in seeing a fully developed anomaly detection product in action you can contact the folks at Avora to see what we’ve built.