When it comes to monitoring our infrastructures a question comes to mind, the answer to which will determine a positive outcome to our CPD.

What metrics must I monitor to know the status of my infrastructures?

In this article we’ll be talking about the main IT metrics you should take into account to know the status of your infrastructure and, in case you run into trouble, how to solve it efficiently.

To begin with, here at Pandora FMS we like to especially emphasize that many companies are very oriented to measuring the user experience, and on occasions forget to measure everything behind the services they offer their users. Orienting our monitoring exclusively to our users’ experience (service functions, response time in customer services, etc.) forgetting IT metrics that back customer services up, will lead to detecting issues late and making the solution more difficult for our infrastructure.

We agree to the fact that user experience must be measured and maximized, but we like to insist on the fact that without the systems that offer services to customers, little or nothing can be done on many occasions, which should lead to first focusing on measuring our infrastructure, and later monitoring user experience and the business itself.

Now is the time to measure the infrastructure, in future articles we’ll discuss customer and business metrics, which can also be included in Pandora FMS.

If you’re evaluating the possibility to define new performance metrics in your infrastructure, we recommend that you first make sure you have the right network monitoring tool (see the end of this article) and that you choose the best network monitoring tool that fits your needs.

 

IT metrics for system control

The point of defining a coherent IT metric monitoring structure is to be able to monitor, manage, optimize and report on all of our services in a regular way.

IT metrics must be designed to guarantee that both the infrastructure, along with networks and applications, are configured and working correctly. For those companies who have infrastructures based on virtual machines, containers or cloud services, they’ll have to apply the same metrics to these systems.

Up next we’ll enumerate the most important metrics to take into account.

Key performance indicators

System performance indicators.

  • Capabilities and storage status for HDDs
  • Network interface status. We’ll have to know if our network interfaces are active and if there’s any issue related to them.
  • Memory and use per server.
  • Status and CPU use per processor.
  • Entry and exit access to our storage discs.
  • Read/write speed on our discs.
  • Number of open threads per processor.

Performance indicators for databases

  • Memory use for each databases.
  • Number of SQL sequence executions, separated by reading (selects) and writing (deletes, inserts and updates).
  • Entry and exit access to discs, originated by each database.
  • Response time for executions based on SQL executions.
  • Number of threads waiting to enter the database.
  • Number of detected blockages when writing on the database.

Application performance indicators

  • Application response times.
  • Availability, in percentages, of our applications. How much time will our application be available and how long does is stop working? It’ll also be necessary to identify the different components that compose an application, and monitor the availability of these.
  • Memory and CPU usage per application
  • Number of times the garbage collector takes action to optimize resources consumed by applications.
  • Number of threads that each application needs.
  • Number of transactions performed by each application, separating the main transactions.
  • Number of failed transactions per application.

Network performance indicators

  • Knowing the bandwidth consumption on each network will allow us to detect possible improvements and impacts in our systems’ function.
  • Connection response time between a point of origin and a destination. Here we must find which are the main communications to monitor and we must control the response time of their communications.
  • Package loss. All network interfaces generate statistics on the number of lost packages in communications. Knowing the status of this loss will be vital to know the health of our network.
  • Network noise or “jitter”. This is important to know if our networks are receiving a substantial amount of noise that can cause information loss, retries and, in general, communication lag.
  • Amount of information transmitted among our application.

 

What to do when our systems have deterioration?

If the IT metrics previously mentioned are being correctly provisioned and informed, whether that be via a panel or through report production, we must be able to find the issues in our infrastructures on time, before larger problems spawn.

The issue comes when we’ve already detected a problem and we don’t know how to face it. It’s very important, therefore, to know which are de main causes of performance deterioration on our systems.

  • Issues with response time elevating among our applications throughout the network. Here it’ll be very important to evaluate the bandwidth consumed by applications in communication to detect if this is being used at over 80% or if noise has entered our network causing a raise in response time.
  • Another issue that deterioration can cause on our network or infrastructure can be the architecture design for an inefficient network. Do you have the network map? If you don’t, then use a network monitor to generate it and evaluate different connections and bandwidth.
  • Overusing resources on our servers can cause system deterioration. The main resources to keep in mind and that mustn’t surpass 80% use capacity, will be disc space, CPU and memory (RAM).
  • Badly structured or inefficient code. If the previous bullet points have been verified and no improvements can be done, the next step will be to evaluate our applications’ source code. On many occasions an inefficient code, or one that causes memory losses, can be the cause of system deterioration.
  • Parallely we must always be alert to possible issues originated by security attacks. Malware installed on our devices or service denial attacks can be causes that reduce performance in our applications.

Final points to keep in mind when measuring our infrastructures

  • Gives the possibility to consult our IT metrics over time. In order to identify issues, it’ll be necessary that all IT metrics that have been previously identified be stored and evaluated over time.
  • Choosing which IT metrics we need to monitor or not. The recommended thing to do is to usually not have over 30 metrics. Anyway, if you decide you need more metrics because of your infrastructure’s complexity, do not, under any circumstances, try to monitor more than 100 metrics.
  • Identify which are the main goals to accomplish in your business (product sales, product hiring, movie services, etc.) and identify which IT metrics are lined up with the services your business offers.
  • Another aspect we must have present is the fact that our systems are more decentralized everyday, and therefore are geographically distributed across different areas. Furthermore, company migrations toward cloud services must be taken very seriously into account, since our system must include monitoring for both in-house systems and cloud services.

 

What do you think about this? Were you missing any IT metrics?

Shares