Root Cause Analysis and Monitoring Tools: A Perfect Match
This post is also available in : Spanish
Root Cause Analysis and Monitoring Tools: How can they work on complex platforms?
Root Cause Analysis (RCA) and monitoring tools have been related for a while, but what is Root Cause Analysis and how is it applied in the world of network and application monitoring?
Root Cause Analysis is a troubleshooting method based on the fact that the most effective way to solve a problem and prevent it from happening again is to determine its root cause and taking action to eliminate it.
RCA is essentially a reactive method. This means given a problem or an event, RCA’s procedures start to identify the root causes in order to prevent the same problem from recurring. However once implemented and with constant execution, RCA is transformed into a method of problem prediction.
RCA implies a reiterative inquiry procedure. That is, with a well identified problem we will make a first analysis of causes knowing two things; this first approach will not identify the real root cause and a persistent inquiry process is absolutely necessary.
An iterative interrogative technique that fits pretty well with RCA and which can be mentioned as an example of the previous point, is the five whys technique, widely used by annoying children.
Faced with a problem we will make a first question: Why did this happen? Immediately we will take the answer; this happened because “cause 1”, which serves as the base for the second question: Why did “cause 1” happen? And so on until completing five whys. In theory the “cause 5” is the root cause.
Beyond this particular technique, the idea is, to be effective, RCA must be performed systematically, drilling deeper into the problem until the root cause is reached.
Let’s continue mentioning the basis for RCA’s implementation, keeping these three highlights in mind:
- Define and describe properly the problem or the event. It is important to define the problem or to describe the event with facts, including the magnitude, the location and the timing. The bottom line here is simple: we have to understand the problem.
- Establish a timeline from normal situation until the final failure or crisis. The idea is to put every behavior, condition, action and inaction related with the problem in a timeline sequence.
- Distinguish between causal factor and root causes. Looking at the timeline we have to classify causes into two categories: causal factors that relate and contribute with the problem and root causes that actually interrupt the sequence when eliminated.
We can find in specialized literature many successful examples of RCA implementations in different areas, from the safety field with accidents analysis, to the medicine world with medical negligence analysis.
Now, what can a problem solving method like RCA do for those who are interested in the world of analysis and monitoring of networks and applications?
Every list of objectives for an analysis and monitoring platform necessarily includes the detection and prevention of failures that can affect the performance or make the services go down.
It is here, in pursuit of this objective, that a well orchestrated relation between root cause analysis and monitoring tools can be useful.
Let’s think of a simple example: We have a well monitored platform, and one web application running over several servers. For one of those servers the disk capacity is reaching a dangerous level, higher than our threshold. Our monitoring platform produces an alert. We receive this alert and we take the correct actions to avoid a performance problem orto prevent the web service from going down.
In this case, a monitoring system based on resources with a behavior check of each element in the network and the servers would work perfectly, and being a simple platform, an RCA implementation is not really necessary.
However, in complex platforms with multiple applications running on own/cloud servers, virtualized servers, a wan communication scheme maybe including several technologies and providers, the story is not so simple.
Let’s think about this scenario: A company has a well monitored complex platform. Someday, the marketing department launches a promo campaign. They decided to use the advertisement space of a TV show well oriented to their objective market.
The TV show is broadcast from 9:00 to 10:00 at night. The launch is a total success, so the web application registers a 100% increase in access requests.
Since 9.30 pm, we receive several alerts from a group of servers, network switches, routers and applications. We decide to check our web application and verify the response time, and yes, it is too high. We know our customers are frustrated and the company is losing money.
The first answer to this performance problem could be “our platform could not support this increase in web demand”. Obviously this answer is not enough so at this point we can use RCA to fight against different causal factors.
The key question here is: How much time we need to do root analysis and how much support our monitoring platform offers? Surely, it is inversely proportional: the more support we have for our monitoring platform, the less time it’ll take to identify the root cause.
So, on what basis is root cause analysis and monitoring tools able to work?
Sight by services
In order to efficiently perform a root cause analysis, we need to check the whole group of elements related with the service . If our monitoring platform can offer us the status of all those elements in one integrated view, it could be great to reduce the time the analysis could take.
Consider dependencies between elements
Let’s look at one classical example to illustrate this point: we have a network switch and five servers connected to it. At one point, if our switch fails, we don’t really need the alerts from each server saying they are not reachable; we just need to see one alert, the alert of the switch.
A great solution could start with a map showing the dependencies between elements, but it would be even better if apart from the solution, it gives us information about the kind of traffic shared between each pair of elements, or measuring the response time in each step.
As we mentioned at the beginning, RCA is essentially a reactive method therefore maintaining historical data is a must. The effort and costs related to maintenance this kind of data could be justified for other activities like capacity planning for example.
It could be very interesting if we can maintain a knowledge database with the relation between the problem and its root causes
As mentioned before, the relationship between root cause analysis and monitoring tools already exists; we may look more closely at those implementations in another post. However, in general, we can see this area as one of the next frontiers to be conquered in the interesting market of monitoring tools.