VMware Cloud on AWS Stretched Cluster AZ Failure Recovery

Introduction

Stretched Clusters are a keystone for many VMware Cloud on AWS deployments. However, to understand why, we need first to revisit availability within AWS. Amazon has designed AWS to be highly resilient within each region. Each region is comprised of multiple Availability Zones (AZ). Each AZ is a dedicated island of infrastructure and shares no common components with any other AZ. Customers can use this fact to build highly resilient deployments with AWS by spanning multiple AZs.

In the event of an AZ failure, the application remains available via redundant deployment. This is directly inverse of traditional on-premises deployments where the infrastructure is redundant to hide any failures from the applications themselves. However, this complicates the migration of many existing workloads into AWS because the application must be adapted to a multi-AZ deployment.

Procedure

Solution

Enter Stretched SDDC within VMware Cloud on AWS. Powered by vSAN Stretched Clusters, a Stretched SDDC physically spanning AZs to protect from AZ failure at the infrastructure layer. This is accomplished by splitting the Cluster hosts evenly between two AZs. In addition, a managed witness host is deployed in a third AZ to protect from split-brain isolation failures. With hosts in three AZs, vSphere can now survive the loss of an entire AZ.

How does it work?

vSphere

From the vSphere perspective, the Cluster is standard. The loss of an AZ is a host failure event. vSphere High Availability (HA) restarts any failed workloads. Recovery assumes the data is available, which is handled by vSAN.

vSAN

The vSAN stretched Cluster is the heart of a stretched SDDC. vSAN both understands that each group of hosts are contained in separate fault domains and can manage risk by distributing data accordingly. This is accomplished by configuring a VM storage policy to distribute the data appropriately. There are two elements to this configuration:

Site Disaster Tolerance:Should the data be resilient to the loss of an entire site/AZ?

Failure to Tolerate: How resilient should the data be within an AZ?

While VMware supports any configurable policy configuration, a mandated minimum must be configured to ensure SLA Credit eligibility in the event of a workload VM failure. The infrastructure guarantee is not contingent on any customer configuration.

Hosts	Site Disaster Tolerance	Failure to Tolerate
2 (1-1)	Dual Site Mirroring	None
4 (2-2)	Dual Site Mirroring	None
6 (3-3) – 16 (8-8)	Dual Site Mirroring	FTT1 Raid-1 or Raid-5

If the required minimum is matched or exceeded vSphere will be able to automatically attempt recovery. Actual recovery depends on resource availability, more on this later.

SDDC Management

The management infrastructure can be distributed between AZs. Thus the impact of a loss of an AZ depends upon which elements were contained in the failed AZ.

vCenter

vCenter is given the highest availability and will be restarted as soon as HA confirms the failure. It can take up to 15min for accessibility to return. Most of this time is waiting for the numerous services to start.

NSX

The NSX unified management is powered on immediately after vCenter. If the active gateway has failed the routes within the Connected VPC are updated to point to the new gateway location. It can take up to 15min for route propagation to complete and reconnect the SDDC to any external networks. If the active gateway is not impacted, then network connectivity will not be disrupted in the event of an AZ failure.

Additional Integrated Services

Any additional integrated services such as VMware Site Recovery, HCX, vRNI, vROPS, etc. are restarted after NSX alongside customer workload. These may not immediately recover if the cluster has insufficient resources to power-on these VMs.

How does the vSphere cluster respond to AZ Failure?

With the fundamentals outlined we can now dive into how the cluster responds to AZ failure. For the most part an AZ failure manifests as a massive host failure event. vSphere HA will detect the failure and take note of any failed VMs. These VMs will be restarted on hosts in the surviving AZ in the order described previously. If the cluster runs out of available resources some workload may not be able to automatically recover. To ensure this does not happen VMware suggests that a Stretched Cluster not be loaded to more then 75% CPU/Memory utilization. To ensure a particular VM will recover, a reservation can be configured on the selected VM, but doing so will have an impact on the number of workloads the cluster can support.

If the NSX active gateway was impacted by the failure, then the SDDC will be temporarily isolated from any external networks. VMs contained within the SDDC will be able to communicate with each other during this time. If the active gateway is not impacted there is no network disruption.

AZ failure recovery has no dependency on vCenter and is executed in parallel as soon as HA confirms the failure which can take up to 10min.

Elastic DRS (eDRS) monitors for host recovery failures and in the event of an AZ failure will attempt to backfill the lost hosts to mitigate the Compute/memory loss. For example, if an 8-host (4-4) cluster was to lose an AZ, eDRS would attempt to add 4-host to the surviving AZ (0-8). Once the failed AZ has recovered eDRS would remove the burst capacity (4-8 -> 4-4). This feature is dependent on available capacity within the AZ and is therefore not guaranteed.

Can I test AZ Failure?

With multiple moving parts and conditions many customers require the ability to test AZ failure. To facilitate this need VMware has developed an AZ failure simulation test that can be requested. While this simulation is fully automated to ensure test consistency, it is monitored by SRE operations who can support a limited number of tests per day. You may request an AZ failure simulation whenever you want designating which AZ to fail and how long the AZ should be down (default 1hr). However, VMware may not be able to support a particular date depending on existing customer scheduled testing.

AZ Failure Simulation

A notification is sent once the AZ Failure simulation is started. Once the testing begins VMware simulates a hard AZ failure by terminating every host in the targeted AZ. If the targeted AZ contained the Management infrastructure there may be a loss in vCenter accessibility and or Network connectivity to any workload VMs. The automation harness monitors vSphere and records recovery times for any impacted workload. Elastic DRS will attempt to backfill the lost hosts with non-billable burst capacity in the surviving AZ. The AZ will remain unavailable for the configured interval (default 1hr). During this time you may verify application recovery and service availability. After the configured failure interval has elapsed the service will restart the hosts in the failed AZ.

Once the hosts are available the service will simulate AZ return. Any data that was written while the AZ was down synchronized. Elastic DRS will remove any burst hosts, while DRS will rebalance the cluster based on any compute policies. Note that by default DRS will not move a VM between AZ’s without a compute policy directive to do so. If you have not configured any Compute Policies, you will need to manually rebalance VM workload. The management infrastructure will remain in the current AZ and will not be moved back on cluster recovery.

Once the testing has been completed and the Cluster is back to healthy operation a report is sent that includes the time to recovery.

Additional Resources

For more information about VMware Cloud on AWS stretched cluster, you can explore the following resources:

Feature Brief: Stretched Cluster

VMware Cloud on AWS: Stretched Clusters