Part I: Anomaly Detection in monitoring: what can we really do?
This post is also available in : Spanish
Anomaly Detection in monitoring: problems we can resolve with it
In recent years we have frequently found the term anomaly detection in monitoring. In fact, some monitoring tools have introduced in their features the customized application of anomaly detection algorithms and some companies offer anomaly detection from data collected by monitoring tools.
If we check literature related with anomaly detection we can easily get lost in the middle of a storm of statistical models, operations research techniques, formulas and analysis methods.
Our purpose is to start a series of articles about anomaly detection in monitoring defining what it really is and what we can expect from it applied in the monitoring.
Firstly, we have to understand what kind of data anomaly detection is applied to.
Anomaly detection is applied to collected, indexed, listed or graphed data in a time order.
Regularly this time order implies that with a given variable the data is taken at successive equally spaced intervals for a specific period. For example, each second for two days, each minute for a week or each five minutes for a month.
This kind of data differs from others, such as the following:
- Cross-sectional data, collected in a given moment considering one or more variables, for example sales volume in black Friday of the year 2017.
- Spatial data related to geographical location, for example all data that is shown in a roadmap representation.
Time-series data is relevant for Monitoring since monitoring tools collect data following this kind of order because metrics fluctuate over time.
Thus, we obtain a graphic that shows how the memory consumption in a server varies over time or how many visitors has a web page during a business cycle (one day, two months, one year).
In the following figure, the graph of a metric A is shown as an example, a metric which we have measured in a time order of each hour.
All methods that analyze time-series data in order to extract information are gathered under an umbrella concept called time-series analysis. These methods generally assume that data is formed by two basic elements:
- A systematic data pattern
- Random noise
Most techniques of time-series analysis present a way to suppress the noise with the final objective to properly interpret the systemic data pattern.
Under the concept time-series analysis we find anomaly detection. However its main goal is not quite to suppress noise; instead anomaly detection works on two fronts:
- Studying how the variable behaves over time, identifying if this behavior is affected or not by a trend or a seasonal component. Here the objective is to define the expected or “normal” behavior.
- Identifying and studying unusual behavior patterns and validating if they have to affect the normal pattern in some way.
In order to illustrate how anomaly detection general logic functions, let’s look back at figure 1. With a first review we can be tempted to consider value 4 as an anomaly, but if we extend the cycle of measurement we can obtain the following graphic:
We can observe value 4 cannot be considered an anomaly. Now, we can be tempted to define any value between 4 and 5 as valid and define a threshold and alarm in our monitoring system.
But when we extend the cycle of capturing data we can obtain these values:
The result is a lot of false alarms because values 0, 0.1 and 0.2 are equally correct. Thus, we can extend our threshold range and consider any value between 0 and 5 as valid but this decision it could be wrong if we get this reading:
In this case we can have a missed alarm, because a value of 1 for metric A in that particular moment could represent an unusual behavior we have to investigate.
Applying some anomaly detection techniques, we can define a systematic data pattern and, based on this, identify unusual behavior more accurately.
In this point, we can define the concept for anomaly detection as the group of techniques used to identify unusual behavior that does not comply to expected data pattern.
Outliers and anomalies
There is another concept usually related with anomaly detection which generates confusion: outlier detection.
For data mining for example, there are no differences between anomaly and outlier. However, many authors establish a difference between both based on the fact that outliers should be considered normal although they can be very distant from systematic data pattern.
In this article we assume that in terms of identification, outliers and anomalies are treated in the same way; we identify an unusual behavior and then we decide whether it has to be considered an outlier or an anomaly.
What can anomaly detection do in monitoring?
Generally the anomalous behavior can be translated to some sort of problem or could be the consequence of this problem. In other disciplines we can read about bank fraud, structural defects, medical problems, intruder detections, etc., as problems which can be detected using anomaly detection techniques.
However, some authors such as Preetam Jinka and Baron Schwartz warn in their book Anomaly Detection for Monitoring that “It (anomaly detection) cannot prove that there is an anomaly in the system, only that there is something unusual about metric you are observing.”
Anomaly Detection cannot be considered as a panacea but just an additional technique that can contribute to our analysis. The following list summarizes what Anomaly detection can do:
- Improve definition and administration of thresholds. Monitoring one of the most time-consuming activities implies to establishing and modifying thresholds. Regularly, we have to define different thresholds for different servers and we have to redefine them in order to consider changes over time in workloads or improvements in applications, etc. Introducing anomaly detection techniques can be useful to reduce all this effort
- Provide an alternative to static threshold to reduce false and missed alerts
- Offer clues of undetected problems: if we apply anomaly detection in monitoring over some metric and we find unusual behavior it could be interesting to carry out analysis and to identify a problem or change what generates it.
- Assist in root cause analysis. With a given a problem, we can use anomaly detection to orient the analysis of this problem to those related metrics with anomaly behavior. If you are interested in RCA we recommend this post.
This is all for our first approach to anomaly detection in monitoring. We invite you to share your experience or expectations on this topic and join us in our next post where we will present an introduction guide to anomaly detection in your data analysis.