HA and DRS in VMware Cloud on AWS

Introduction

VMware Cloud on AWS uses, and turbocharges, vSphere Distributed Resource Scheduler (DRS) and vSphere High Availability (HA) for unmatched availability and performance enhancements. Because VMware Cloud on AWS deploys SDDCs on top of global AWS infrastructure, customers benefit from the enhancements developed using HA and DRS capabilities.

vSphere Distributed Resource Scheduler (DRS) and vSphere High Availability (HA) are key features for vSphere environments, including VMware Cloud on AWS. DRS works to ensure workloads are getting the resources that they are entitled to by placement or balancing. vSphere HA ensures workload availability when a host in a cluster fails, is isolated, or when a VM heartbeat is interrupted.

As VMware Cloud on AWS is a managed service, both DRS and HA are enabled by default on all the clusters in all SDDCs. Customers cannot change these settings in VMware Cloud on AWS. This blog post I wrote to shed a bit more light on how these features are configured and cover some questions around virtual machine (VM) and application monitoring settings in VMware Cloud on AWS.

vSphere DRS

vSphere DRS is enabled and set to fully automated. The migration threshold is set for priority 1, 2, and 3 recommendations. The migration threshold specifies how aggressively DRS recommends balancing VMs using vMotion. The general recommendations are generated automatically based on resources demanded by the virtual machines, resource allocation settings (reservations, limits, and shares) the resources provided by each host, and the cost of migrating VMs.

Elastic DRS

Elastic DRS (EDRS) is unique to VMware Cloud on AWS. EDRS brings a policy-based approach for true cloud elasticity, for scaling out and scaling in scenarios. It is built on top of our developed Autoscaler logic. Autoscaler is the backend capability that is used to automatically add, and remove, hosts from VMware Cloud on AWS SDDCs. It is driven resource consumption and triggered by either pre-defined policies or custom policies configured by customers.

On vSAN enabled clusters, the storage baseline policy always makes sure to add host capacity if vSAN storage capacity is > 80%, another mechanism to ensure application availability for customers. More information about custom EDRS policies is found here.

vSphere HA

The same Autoscaler logic is put to good use for vSphere HA in VMware Cloud on AWS. Typically, customers need to immediately remedy a failing host in a cluster by fixing or replacing it. Or customers incorporate spare hosts, which is not cost-effective. vSphere HA takes care of re-registering and powering-on VMs on surviving hosts in the cluster. But, depending on HA settings, a cluster is at risk of resource constraints or VM downtime if another host fails in the same timeframe.

VMware Cloud on AWS solves this by automatically adding a host if a host failure is detected. The goal is to have the number of hosts available that customers configured for a cluster. In the event of a host failure and vSphere HA triggers, a new host is immediately added to the cluster.

With vSAN enabled instance types, the data is resynced and vSphere HA powers on the VMs on the new host. The faulty host is removed. No customers interaction required, this is done automatically and maximizes VM availability.

The default vSphere HA settings in VMware Cloud on AWS are accounting for host failure, host isolation, datastore protection, and VM/Application monitoring. Admission control, depicting the failover capacity in a cluster, is set to percentage-based, with the actual percentage value depending on the number of hosts in a cluster. This example screenshot is created on 3-host cluster.

VM and Application Monitoring

The vSphere HA failure scenario of 'Guest not heartbeating' is also configured. VM and application monitoring is enabled. It helps to monitor the guest OS and applications running inside a VM, leveraging VMtools. If heartbeats from VMtools are interrupted, or no IO is generated by the VM, it is likely the Guest OS has crashed. A vSphere HA event can be triggered, resetting the VM, depending on settings. The default setting in VMware Cloud on AWS are not visible for customers. But are configured as the following screenshot;

Monitoring

While customers have no access to change settings (see the 'edit' button greyed out), they can monitor the behavior. The vSphere Client provides insights into both DRS and HA specifics like DRS history and overall HA information.

To Conclude

VMware Cloud on AWS ensures workload performance and availability so our customers don't have to worry about that.