Skip to main content

Monitoring Virtual Machines

How VM monitoring is different

What you measure and how you measure it varies based on where you are measuring from. Monitoring an application will have different metrics compared to monitoring at the middleware, OS, or infrastructure layer. In this instance I am referring to monitoring the VM and associated OS from within the machine itself, where you are using these metrics as indicators regarding service health. Monitoring the VM from outside to VM at an infrastructure level, such as with vSphere will have a similar set of measures but non-identical.

What to measure

It's important to use the best correlated metric for what you are observing. Whether someone is overweight is less accurately measured looking at their overall weight, and more accurately measured by looking at the excess body fat percent. In most cases the metrics that are used to traditionally monitor servers are no longer the most effective means of confirming health, as we have evolved both what we can measure and also the speed at which we can adjust the component resources as we move from physical to virtual server instances.

CPU

CPU I/O Wait %

A CPU has four associated metrics. CPU Idle, which is the time spent not doing any task. CPU System covers system-related tasks and driver activities. CPU User covers all activity in userspace. And CPU I/O Wait is time spent by the CPU waiting for the response from downstream hardware before it is able to complete an I/O. Typical CPU Usage % metrics aggregate the System, User and I/O Wait metrics into a single ratio and use that as the monitoring metric. The issue in measuring it this way is that we want our User and (to a lesser extent) System CPU usage to be as high as possible to ensure we are getting the most out of our servers per CPU cycle. CPU I/O wait however represents time that the CPU is bottlenecked waiting for downstream tasks and affecting the performance of our application. Total CPU Consumption ideally should be as close as possible to 100% without hitting it and without excessive I/O Wait signals. Increasing CPU User and CPU System will tend to increase CPU Idle to some extent, but they are not the indicators of an issue that needs to be resolved, they are indicators that a system is being used!

Memory

 

Storage

 

Network