Skip to main content

Monitoring Virtual Machines

How VM monitoring is different

What you measure and how you measure it varies based on where you are measuring from. Monitoring an application will have different metrics compared to monitoring at the middleware, OS, or infrastructure layer. In this instance I am referring to monitoring the VM and associated OS from within the machine itself, where you are using these metrics as indicators regarding service health. Monitoring the VM from outside to VM at an infrastructure level, such as with vSphere will have a similar set of measures but non-identical. Best practice would be to include these measures as additional evidence when troubleshooting, but should be viewed secondarily to application-level telemetry in almost all cases.

What to measure

It's important to use the best correlated metric for what you are observing. Whether someone is overweight is less accurately measured looking at their overall weight, and more accurately measured by looking at the excess body fat percent. In most cases the metrics that are used to traditionally monitor servers are no longer the most effective means of confirming health, as we have evolved both what we can measure and also the speed at which we can adjust the component resources as we move from physical to virtual server instances.

CPU

CPU I/O Wait % (not CPU Usage %)

A CPU has four associated metrics. CPU Idle, which is the time spent not doing any task. CPU System covers system-related tasks and driver activities. CPU User covers all activity in userspace. And CPU I/O Wait is time spent by the CPU waiting for the response from downstream hardware before it is able to complete an I/O. Typical CPU Usage % metrics aggregate the System, User and I/O Wait metrics into a single ratio and use that as the monitoring metric. The issue in measuring it this way is that we want our User and (to a lesser extent) System CPU usage to be as high as possible to ensure we are getting the most out of our servers per CPU cycle. CPU I/O wait however represents time that the CPU is bottlenecked waiting for downstream tasks and affecting the performance of our application. Total CPU Consumption ideally should be as close as possible to 100% without hitting it and without excessive I/O Wait signals. Increasing CPU User and CPU System will tend to increase CPU Idle to some extent, but they are not the indicators of an issue that needs to be resolved, they are indicators that a system is being used!

Memory

Memory Swap-In Rate (not Memory Consumed %)

Memory consumption is not a strong indicator of the actual pattern of memory usage within a server. Services such as databases will aggressively utilise available memory pages however as these pages become stale they are not actively released, instead usually held until new data becomes available and the page can be updated. This tendency, also shared by many other types of services can mean that an inflated memory consumed value is presented that can imply that a service is using more memory than needed. A more effective measure of the level of pressure that a system is under is to measure the memory swap-in rate for those servers. Where servers are under memory pressure, existing aged pages will be flushed to disk to the swap-file and then if needed will be moved back into the active memory if called by the application. The process of swapping memory into the swap file is not a concern in most cases, as the pages being placed in the swap file are stale or otherwise not recently accessed. The delays come in when that swap file needs to be read and the page is returned from the swap file into memory, as this can introduce latency to the application as the page is promoted. By measuring the swap-in rate of the memory in a VM we are better able to understand the constraints and pressures on the actual memory system and understand if available memory is a bottleneck to the environment.

Storage

I/O Latency or Disk Queue Length (Not IOPS)

Depending on what you are looking to monitor, either I/O Latency or Disk Queue Length (ideally both) are the most suitable measures for storage on a server. I/O Latency measures the time taken for a request to be sent and received by the underlying storage device, which could be a local disk or a SAN array. Particularly where the storage device is shared among multiple servers, monitoring the latency of the storage device can provide an insight on the level of load the device is under, as any changes in the number of operations being processed will have an effect on the latency of those operations. Similarly, Disk Queue Length can highlight hotspots of activity where requests are processed across multiple drives and an increase in the Disk Queue Length can in turn increase latency but identifies the source of the latency as related to activity happening on the server itself. Measuring based on IOPS does not provide meaningful insight as the cause of low IOPS can be affected by external factors such as number of users, as well as the count of operations being less impactful than the variation in speed of response of those operations.

Network

Packets Dropped (Not bytes sent/received)

In most cases you can be quite confident that network will not be the bottleneck in an environment. Where networks become bottlenecks, the standard response from systems is to drop packets being sent or received to keep up with the load once buffers fill. By monitoring the packet drop rate you can determine if the incoming or outgoing buffers are oversaturated and find the bottleneck. Two caveats apply to monitoring the network though. The first is that it is also best practice to include an external monitor service to verify network connectivity and that packets are able to be received. A network dropping no packets because it is receiving none is difficult to verify from within the server. The second is that depending on the system, firewalling rules may drop invalid packets coming in that are not due to network load but security rules. Some services will report these as received packets on the network and then log them as dropped further upstream, whereas other services will combine both into a single metric which can lead to false-positives. In services that utilise the latter, changing the firewall rules to reject instead of drop packets may improve the reporting for these endpoints.