10 important performance indicators that every VMware administrator needs to monitor

Virtualization technology is being widely applied thanks to the flexibility, fast, stable and easy administration that it brings. At the same time, any IT technology - hardware or software - can only function properly under full maintenance, and so does the VMware platform. With physical servers, the error or poor performance of the system affects the applications that run on it. With the virtualization platform, multiple virtual machines (VMs) run on the same physical server, and server slowness will affect applications that run on all VMs. Therefore, performance monitoring is even more important within the virtualized infrastructure than the physical infrastructure.

Monitoring performance in a virtualized environment is even more important than physical infrastructure
Performance of applications running on virtual machines depends on many factors:
Physical resources from the physical servers below are shared by virtual machines (VMs). If some VMs consume too many resources (CPU, memory, disk), other VMs may no longer have access to the resources when they need it. This affects application performance on other VMs.
 Administrators can limit the resources available to the VM. If the limits are not set correctly, this can reduce the performance of applications on these VMs.
 Administrators often grant over-commit resource levels on physical servers, because all VMs running on this server rarely use up resources at the same time. Although over-commit ensures more efficient use of hardware resources, administrators need to monitor actual usage on the server to identify and fix situations in which the physical server is down. lack of resources and therefore, the performance of the VM running on it is affected.
Over-allocating resources to the VM is also not a good solution. First, excessive allocation leads to the use of inferior hardware below, thus resulting in a poor return on investment. Second, allocating too much CPU to the virtual machine can cause it to be delayed waiting for sufficient CPU resources to be available, thus affecting performance.

So, how to determine what is the appropriate amount of resources to allocate to VM? The answer to that question lies in keeping track of the virtual machine's resource usage over time, determining usage and then sizing sensibly to the VM.

But how to track resource usage metrics for the VM and which one is important? VMware vSphere includes many different resource components. Knowing what these components are and each of which influences resource management decisions is the key to effective VM management. In this article, we will discuss the top 10 parameters that every VMware administrator must constantly monitor.

Top 10 performance indicators for VMware administrators

Memory Ballooning
Memory Swapping
VM CPU Wait and VM CPU Ready
Large and Old VM Snapshots
Idle / Orphaned VMs
VM Disk Read / Write IOPS and Throughput
Datastore Capacity Usage and Availability
VM Network Connectivity
Hardware Health
VM Resource Usage (Inside and Outside View)

# 1 Memory Ballooning
Memory Ballooning is a memory recovery technique used by the hypervisor to allow physical server systems to reclaim unused memory from VMs, meaning VMs are running out of memory. The memory can be re-used.

Typically, the hypervisor allocates a portion of the physical server memory to each VM. The guest operating system running inside each VM does not capture the total memory available to both the physical server. Memory Ballooning makes the guest operating system aware of the physical server's memory shortage. Whenever the physical server faces a memory struggle, the Ballooning driver installed in the guest operating system will determine if unused memory can be reclaimed from any VM. . The driver then determines the amount of memory resources on the VM that is using the allocated memory in excess, and then signals the hypevisor to reclaim the unused memory from that VM. Next, the hypervisor will provide this redundant memory to any VM that is short of memory on the server.

The mechanism of action of Memory Ballooning
Ballooning allows for efficient use of physical memory, but at the expense of VM performance is slightly reduced. This is because excessive memory ballooning on the hypervisor may cause guest operating systems to read from the hard drive. High Disk I / O can reduce VM performance. To prevent excessive Memory Ballooning, administrators must constantly monitor the amount of memory that the hypervisor is reclaiming from VMs and ensure that it does not grow too close to the set Ballooning target. Monitoring the VM and guest operating systems alone will be less helpful in this regard. One must monitor Ballooning activity at the hypervisor level to actively detect and control redundancy.

# 2 Memory Swapping
Memory Swapping occurs when the memory status of a VMware vSphere server is ‘hard’ or ‘low’. VSphere memory switches to one of these states when recovery techniques such as ballooning, page sharing, and compression cannot keep up with the VM's memory allocation. At this point, vSphere will need Memory Swapping.

The vSphere steps swap (swapping) the memory from the client
Swapping occurs at guest operating systems and hypevisor level.

With hypervisor-level swapping, the memory page on the VM is swapped over the swapping area on the hypervisor. Each VM is associated with its own swap space. When the guest operating system accesses a memory page from the swap space, vSphere handles access by swapping in that page from the swap space. The wait for vCPU can increase in exchange operations, causing a negative impact on VM performance. Moreover, insufficient swap space can also reduce VM performance.
In guest operating system exchange, every time the CPU accesses the virtual memory page on the guest operating system, that memory page is swapped into physical memory. In this way, frequently accessed virtual memory pages become available in physical memory, so they can be served quickly. The rarely used memory pages are swapped for storage. Therefore, with swaps, there is a high risk of disk I / O and slow calculations, due to frequent read and write and high swap rates between physical memory and storage.
Monitoring solutions that focus solely on VM performance should be able to acquire slow VMs; but will not be able to diagnose its root cause. An ideal VMware monitoring solution is one that can monitor the exchange and exchange rates at the hypervisor level and at the guest operating system level, automatically correlating these metrics and pinpointing performance. What is the capacity of the VM. It is also important to monitor memory configuration and redundancy on each VM, as that will give administrators a reasonable sense of how much swap space is available when processing the VM.

# 3 VM CPU Wait and VM CPU Ready
The VM's virtual CPU (vCPU) can be in one of four basic states: run, wait, co-stop and ready.

4 status of the VM vCPU
From a performance monitoring perspective, it is imperative that administrators know when and for how long the VM has been in the waiting and ready state of vCPU.

vCPU Wait Time
A VM that is waiting for a task to complete may not require its vCPU immediately. The time that the VM keeps the vCPU waiting for this purpose is the vCPU timeout. Usually, a VM can wait because it has nothing to do until an event occurs. For example, network packet expiration or timer. This is called idle waiting. Highs and lows during idle waiting are negligible because they do not imply a problem condition. On the other hand, if the VM is waiting for read / write on the storage to complete and cannot do anything else until it is completed, it is called waiting for I / O. Unlike waiting idle, waiting for I / O has a performance impact. The longer I / O wait time, the VM operation will be slower. Waiting for I / O is also a sign of unavailable, overloaded or implicit storage. Therefore, it is important for administrators to monitor the vCPU I / O wait time on each VM.

vCPU Ready Time
VCPU Ready Time is the percentage of VM time that is ready but cannot run physical CPUs. One of the common causes for high vCPU availability is too many registrations. If the VM is allocated more vCPUs than the physical CPU (pCPU) is available on the server, then during heavy load, when ideally, all vCPUs must run full time, many vCPUs might not work because they want to have pCPU. Result: The VM and the applications running on it will lack processing power, thus, will reduce the performance of the VM. Therefore, it is important to monitor the vCPU availability time of each VM. If this metric is more than 5% for a VM, it indicates that the VM is slow. You can correlate this metric with the host CPU usage to find out if there is a dispute over physical CPU resources during the same time that the vCPU is ready to skyrocket. If so, you can conclude that the VM is overregistering the host's CPU resources. For corrosion, you can also track the number of pCPUs available for the server and the number of vCPUs allocated to each VM. This will show you the oversized VMs and prompt you to resize those VMs, so you can minimize the availability time of vCPU. The recommended vCPU and pCPU ratio is from 1: 1 to 3: 1.

# 4 Large and Old VM Snapshots
The snapshot captures the entire state of the virtual machine at the time the snapshot is taken. It includes the contents of virtual memory Virtual memory, virtual machine settings and the status of all virtual disks of the virtual machine.

Example of a VM snapshot
After a snapshot is made, any changes that need to be made to the original virtual disk (VMDK) are first written to a growing snapshot file. Depending on the level of activity on the VM, over time, this snapshot file may even grow to the size of the original virtual disk file. When there are multiple snapshot files, their combined disk space usage may even exceed the size of the original virtual disk file. If there is not enough disk space for the VM, large snapshots can cause the snapshot storage location to run out of space, thus adversely affecting VM performance. What's worse is that one or more super active virtual machines that use the same datastore can even generate development snapshot files to consume the entire data warehouse space! This can seriously affect the performance of all other virtual machines when using that data warehouse. Therefore, administrators should keep an eye out for unusually large snapshot files, check their contents to see if the changes they hold have been committed to the disk, and delete snapshot files without any errors. any changes not committed, because that file is no longer useful. This will help save storage space and ensure the highest performance of the virtual machine.

VMware administrators need notifications to remind them of VM snapshots that have been around for a few days.
VMware also recommends not using snapshot files for more than 72 hours. Besides unnecessary storage space, old snapshot files can also cause problems in version control for applications and VMs. To ensure that such snapshots do not affect VM performance, it is best to constantly monitor the age of the snapshot files, quarantine old / outdated files and delete them.

# 5 Idle / Orphaned VMs
Zombie / zombie virtual machines are virtual machines that are still running and continue to consume valuable CPU, memory and storage resources, although they are no longer in use. For example, suppose a virtual machine is assigned to an employee who will later resign. But if the VM doesn't stop working and isn't assigned to other users later, that VM becomes an idle VM.

Orphan VMs are those that exist as data in the vCenter server database but have been deleted or are no longer registered with the server. Sometimes, a VMDK disk or individual files may be orphaned. Some common causes for this unwanted scenario are:

A server failover or DRS migration failed.
Remove the VM from the repository when it is connected directly to vSphere instead of vCenter.
Restore vCenter server or its database from backup or snapshot.
Both idle virtual machines and orphan virtual machines unnecessarily deplete physical resources, affecting the performance of the active virtual machine. Moreover, the popularity of such VMs leads to virtualization or VM expansion - a condition in which the number of VMs reaches an unmanageable percentage. Monitoring the number and status of virtual machines on the server will help administrators isolate and regain unused resources and allow them to effectively manage VM performance.

Administrators should report highlighting VM Sprawl

# 6 VM Disk Read / Write IOPS and Throughput
The most common but accurate indicators of virtual disk health are disk throughput and disk IOPS. The amount of throughput a virtual disk can provide and the number of read / write operations it can support in a second determines how quickly the virtual disk can process commands or I / O requests. If a virtual disk has no size with sufficient throughput or I / O processing capacity, the VM that uses the virtual disk and applications running on that VM will be significantly slowed down. Moreover, if the VM / application sends more throughput than its virtual disk configured to support, it will increase the pressure on that VM's vCPU and virtual memory. This in turn can cause the VM to suck more CPU and physical memory, thus causing other VMs to compete for limited physical resources. This can also lead to performance degradation of other virtual machines. To avoid such adversity, administrators should closely monitor throughput and IOPS on each virtual disk, correlate time with these values ​​with CPU and memory usage, and actively determine if this is the case. no memory size change.

# 7 Datastore Capacity Usage and Availability
VMware vSphere uses a data warehouse to store all the files associated with its virtual machines. A data warehouse is a logical storage unit that can use disk space on a physical device, a disk partition, or spread across several physical devices.

Relation between a virtual server (host) and datastore
Without a datastore, VMware vSphere cannot provide VMs. If the datastore becomes suddenly unavailable, then users will be denied access to all virtual machines / applications that use the datastore. To ensure uninterrupted user access to their virtual machines / applications, administrators should keep tabs on the state of the datastore, promptly detect its availability, and quickly quarantine resources. multiply its roots and fix it.
Excessive use of disk space in a data warehouse can also result in significant VM performance degradation. If more than 75% of the data warehouse's disk space is used, it signals a potential 'battle for space' among the virtual machines sharing that data warehouse. In such situations, the administrator should quickly identify the VM is hungry for space and understand why it is selfishly consuming space. Otherwise, this could cause other virtual machines to use the same datastore severely affected performance.

Issues related to data warehouse availability and space usage become more pronounced when the data warehouse is configured on an external storage such as SAN / NAS. The reason is that, in this case, misconfiguration or obstacles in internal operations or loss of contact with the external storage device below may also affect data warehouse health. Therefore, administrators can monitor individual storage arrays along with VMs and datastores, intelligently correlate problems on storage and virtualization layers, and accurately isolate the location of the congestion.

# 8 VM Network Connectivity
When users complain that the VM is inaccessible or slow, the reason may not always be because the VM is powered off or encountering internal resource disputes. Typically, such problems can be attributed to temporary / prolonged disruption in network connection or potential network connection with the VM. Therefore, monitoring the internal health of the VM will not be enough. It is also important that administrators monitor the connection to each VM from an external perspective. This scenario is more useful when users come to your virtualization environment from different geographical areas! Monitoring external connections in such environments will direct administrators to specific geographic areas facing ongoing connectivity issues. Monitoring the status and performance of virtual switches and virtual ports also helps to troubleshoot connectivity issues effectively.

# 9 Hardware Health

Failure of the hardware caused a fatal blow to the health of vSphere server and VM. Deactivated processor, fan stopped, sudden and sudden increase in temperature / voltage of hardware, corrupted memory partition, etc., can instantly damage a physical server reduces both the server and the virtual machine on it. Therefore timely detection and rapid recovery from hardware failures is very important.

Like server hardware, the VM hardware status should also be monitored, because hardware errors that the VM encounters can adversely affect the performance and performance of the VM.

# 10 VM Resource Usage (Inside and Outside View)
Monitor virtual machine resource usage from virtualization machines, that is, from 'outside the virtual machine - will show you virtual machines that are lacking resources on the server. However, in order to know why the VM consumes resources excessively, it is important for administrators to measure the performance of the VM from within the VM, i.e., monitor how the VM is using CPU resources, the set Memory, network and drive are allocated to it. This will direct the administrator to the root cause of resource disputes at the VM level.

Key questions answered when monitoring the VM from the outside and from the inside
The most common approach to monitor VM resource usage is to install monitors / agents on each VM. This approach is not recommended because it is time consuming and costs escalation. Ideally, a monitoring solution would be able to provide insights about the internal operations and resource usage of each VM without requiring a monitor / agent on each VM.

These figures are just the tip of the VMware surveillance iceberg! For best VM performance, administrators may also want to monitor the status and version of the VMware Tools installed on each VM. The use of GPU and vGPU also needs to be monitored so that both physical and virtual servers are sized with the right amount of GPU resources. Monitoring the uptime of virtual machines and hosts can help capture unexpected reboots. TCP connections to the VM also need to be monitored so they can immediately be detected and investigated. And the list continues! It is important that administrators continuously collect these metrics and analyze them, because that analysis can shed light on potential performance issues and can allow administrators to resolve issues. before they affect the business.