Observability and Monitoring, same thing?
This post is also available in: Spanish
Observability: a systems’ attribute and its possible influence in Monitoring
We have been listening to the term Observability for a while now, always associated with Monitoring and even though the term changes meaning depending on the article we are reading or the conference we are assisting to, it seems like Observability is here to stay.
Now, what is Observability and what does it have to do with Monitoring?
There are those who consider that Observability is nothing more than a modern concept for Monitoring, but if we review the concept of Observability, we can see that idea is not well supported.
It states that Observability is an attribute of the systems and not an activity we executed. This is the basic difference between Observability and Monitoring.
The concept of Observability refers to a “system” but since we are working with Observability in relation to Monitoring, we propose to understand it as a system for any element to be monitored, i.e. server, network, service or application.
Based on the concept, we understand that any system can be more or less Observable but the concept does not indicate anything about:
- In what way should we measure the outputs of the system?
- In what way, given the evaluation of the outputs, can or must we infer the state of the system?
Some authors explain Observability as an umbrella concept that includes monitoring activities plus alerts and alerts management, visualization, trace analysis for distributed systems and log analysis.
This concept is also difficult to accept, specially for those like myself who understand that is precisely the monitoring whose has to include those activities in order to execute its major objective of translate IT metrics into business meaning and attending challenges imposed by emerging technologies like application development based on containers, cloud infrastructure and DevOps.
However, Observability does not seem like an attribute we can dismiss, on the contrary, it looks important enough to include it in the same group as efficiency, usability, testability auditability, reliability, etc.
Then, let’s assume that Observability is a desirable attribute in our systems and monitoring is the activity of observing and evaluating the behavior and performance of said systems.
In this case, it is important to review which characteristics the system must have to be observed and which monitoring system will be used to observe them.
Regarding the architecture of the elements to be monitored, the concept of Observability leads us to consider, among other things:
Observability and monitoring as key elements in design
It is desirable that Observability and monitoring are not an afterthought when designing; instead, they must be considered from the beginning, thus avoiding problems during the implementation of monitoring systems.
In fact, in its SRE guide (Site Reliability Engineering), Google explains how they use a reformulated Maslow pyramid to implement distributed systems where Monitoring is included as the base of the pyramid.
Originally, Abraham Maslow proposed a pyramid organizing human needs in a hierarchy of need, having the most essential needs in the bottom; bearing this in mind, Google engineers took that model and adapted it to the key elements for developing and running distributed systems.
Without monitoring, you have no way to tell whether the service is even working; absent a thoughtfully designed monitoring infrastructure, you’re flying blind.
-Google SRE guide, chapter 3
A reduced cost for implementing a monitoring scheme
Highly observable systems will mean a lower cost for the implementation of a monitoring scheme.
Let’s consider for a moment we want to integrate a specific system to our platform but we need our own custom-developed solution for monitoring either because none of the commercial monitoring tools includes this kind of monitoring or because the system’s nature make it incompatible with those monitoring tools. Then maybe, the best solution here is to dismiss this system and choose another more compatible with the idea of observability.
Observability as a cultural value
In 2013 Twitter published the first of two documents about how its engineering group faces the need to evaluate the performance of its services.
In this first document they report that they have a group of people called Observability team.
In the second document (2106) they established the mission of the Observability team:
“… provide full-stack libraries and multiple services to our internal engineering teams to monitor service health, alert on issues, support root cause investigation by providing distributed systems call traces, and support diagnosis by creating a searchable index of aggregated application/system logs. »
In this way Twitter lets us to know how important the Observability culture in the company is.
Nowadays, authors like Theo Schlossnagle (@postwait) and Baron Schwartz (@xaprb) have pointed out the importance of a solid Observability culture.
Well known failure possibilities
Designing and developing observable systems implies necessarily a solid knowledge of the failure possibilities for the system’s group of key elements.
That knowledge could be the base for a later choice of the correct metrics, a proper alert customization and for the definition of an appropriate process for fault recovery and performance maintenance.
Regarding the architecture of monitoring systems we have to retake this classification and evaluate which one is more compatible with the idea of Observability:
Blackbox monitoring refers to a monitoring a system from the outside, based on the externally visible behavior and treating the system as a blackbox.
Blackbox monitoring is based on a centralized process of data collection through queries to elements to be monitored.
So, monitored elements assume a passive role, responding only about their behavior and performance when they are queried by a central collector or active element. Blackbox monitoring implies vertical scalability.
A very rudimentary example would be a central system pinging in order to have information about activity of monitored elements.
Traditionally, blackbox monitoring is focused on the measurement of availability and has as priority the reduction of downtime.
In a whitebox architecture the monitored elements become active when sending data about their behavior and performance to a monitoring system that is able to listen to them.
The emitters report data whenever they’re able, generally as soon as the information is generated. The transmission is executed choosing a scheme and communication format that’s appropriate for the monitored element and the collection system. Whitebox monitoring implies vertical scalability.
This kind of architecture is focused to evaluate the behavior and the quality of services exposed by internal elements of the system.
Observability guides us more towards whitebox monitoring. The main reason is the maintenance of internal elements reporting, which means a big advantage at the moment of inferring which are the internal status. However, we still think that the best idea would be a mixed scheme.
For our readers interested in Pandora’s architecture, we recommend visiting this page for more detail.