Using VMware Cloud Disaster Recovery and HCX Stretched Networks

Executive Overview

Some of the most important considerations around architecting for a comprehensive disaster recovery (DR) solution are the network considerations. In this document, we will walk through some of the most common DR testing or outage scenarios and describe how VMware technologies, such as VMware Cloud Disaster Recovery™ and VMware HCX®, can assist in providing a comprehensive framework for architecting and delivering robust disaster recovery to cloud.

Many organizations find it necessary to keep networking changes to a minimum when building a DR solution. This frequently means maintaining IP addresses for systems that are being recovered during failover or testing exercises. This white paper will review the scenarios that could warrant a disaster recovery failover or test, and explore options and considerations for maintaining current IP addresses between production and failover environments. While this white paper is primarily focused on leveraging VMware HCX for network extensions, the conceptual consideration of each situation is the same with or without VMware HCX.

VMware Cloud Disaster Recovery and VMware HCX make the complex strategy around disaster recovery far simpler, providing a modern DR solution using public cloud resources in an operationally consistent fashion. Our goal is to make disaster recovery objectives achievable, ensure recovery time and recovery point objectives (RTO/RPO) are met each time in a measurable fashion, and reduce costs and complexities of DR.

This document is intended to highlight different disaster recovery and test scenarios using VMware Cloud Disaster Recovery and VMware HCX, providing high-level guidance for the implementation considerations for each scenario. This document is not designed to provide a step-by-step usage guide for the solutions mentioned (VMware HCX, VMware Cloud Disaster Recovery, and VMware Cloud™ on AWS). Detailed documentation and usage guides for these solutions are available at:

docs.vmware.com/en/VMware-Cloud-on-AWS
docs.vmware.com/en/VMware-Cloud-Disaster-Recovery
docs.vmware.com/en/VMware-HCX

For the purposes of this white paper, the VMware Cloud on AWS M14 release with the VMware HCX add-on as well VMware Cloud Disaster Recovery version 7.21.4 have been used.

Solution Overview

VMware Cloud Disaster Recovery

VMware Cloud Disaster Recovery protects your virtual machines (VMs), on premises or on VMware Cloud on AWS, by replicating them to the cloud and recovering them to a VMware Cloud on AWS software-defined data center (SDDC).

VMware Cloud Disaster Recovery provides the following main features and benefits:

Comprehensive SaaS simplicity across DR automation, on-demand failover recovery and site preparedness, the SCFS for cloud backup, and failback
Live mount snapshots from the cloud provides the ability for hosts in VMware Cloud on AWS to boot VMs directly from copies stored securely in cloud backup (SCFS); the cloud backup site acts as an NFS datastore for the recovery SDDC
Cost optimized: Pay-per-use DR site in public cloud, restarting VMs from low-cost cloud backup storage
Disaster and cybercrime recovery based on backups
Continuous DR compliance checks
Automated compliance reporting
End-to-end security

VMware Cloud Disaster Recovery manages all aspects of AWS and VMware Cloud, based on your parameters, with complete runbook automation. For maximum simplicity, the recovery site is VMware Cloud on AWS. As a result, there is a consistent operating environment that seamlessly spans primary and recovery sites. VMware vCenter® remains the operating console, VMs still have VMware characteristics without conversion, hosts are still hosts, and distributed resource schedule (DRS) and high availability (HA) continue as usual.

VMware HCX

VMware HCX, an application mobility platform, simplifies application migration, workload rebalancing, and business continuity across data centers and clouds. VMware HCX enables high-performance, large-scale app mobility across VMware vSphere® and non-vSphere cloud and on-premises environments to accelerate data center modernization and cloud transformation. VMware HCX automates the creation of an optimized network interconnect and extension, and facilitates interoperability across KVM, Hyper-V and older vSphere versions under extended support (within Technical Guidance) up through vSphere versions. This delivers live and bulk migration capabilities without redesigning the application or re-architecting networks.

VMware HCX abstracts on-premises and cloud resources based on vSphere and presents them to applications as one continuous resource. An encrypted, high-throughput, WAN-optimized, load-balanced, traffic-engineered hybrid interconnect automates the creation of a network extension. This allows support for hybrid services, such as application migration, workload rebalancing, and optimized disaster recovery. With a VMware HCX hybrid interconnect in place, applications can reside anywhere, independent of the hardware and software underneath.

Mobility Optimized Networking

When extending networks to a remote VMware NSX-T™ Data Center, you can enable the VMware HCX Mobility Optimized Networking service to route the network traffic based on the locality of the source and destination virtual machines.

This service ensures traffic from the local and remote data centers uses an optimal path to reach its destination, while all flows remain symmetric.

In the absence of Mobility Optimized Networking, all traffic from workloads on an extended network at the destination site is routed through the source environment router.

Network Extension with VMware HCX Mobility Optimized Networking provides the following functionality:

Enable or disable Mobility Optimized Networking at the time of stretching a network
Enable or disable Mobility Optimized Networking for already extended networks
Enable or disable Mobility Optimized Networking on an individual VM basis for VMs residing on extended networks in the SDDC
Display which VMs are using Mobility Optimized Networking

When using VMware HCX to migrate a VM, preserve existing network connections by providing enabling Mobility Optimized Networking on that VM after migration.

Note: The use of HCX Mobility Optimized Networking is not currently supported generally for disaster recovery use cases, nor specifically with VMware Cloud Disaster Recovery. Networks configured for use with HCX Mobility Optimized Networks are supported as virtual machine networks in VMware Cloud on AWS, but not for use as recovery networks when used with VMware Cloud Disaster Recovery.

Solution Testing Architecture

Figure 1: Testing environment overview.

Overview

For the purposes of this white paper, a VMware Cloud on AWS SDDC is deployed in the Europe (Frankfurt) region and is configured as the source data center where primary workloads are running. The Frankfurt SDDC is leveraging VMware Cloud Disaster Recovery for protection, and a second SDDC located in the U.S. West (Oregon) region is used as the recovery site. Both SDDCs are up and running with the Oregon SDDC running as a minimal, pilot light cluster with only two running VMware ESXi™ hosts. This SDDC will scale up with more on-demand hosts as needed to provide adequate capacity for disaster recovery. Distinct, routed, network segments for management and compute (A and B) have been deployed at each site, and a route-based VPN has been established between the two sites, providing routing and connectivity across the SDDCs.

VMware HCX (included as an add-on for VMware Cloud on AWS) is deployed at both sides, and a service mesh between the two SDDCs provides extended layer-2 (L2) segments (Extended Segment 1 and 2) across the two sites. These VMware HCX L2 extensions allow virtual machines to sit on the same layer-2 broadcast domain in either site. This approach also eliminates the need to manually create segments and enter IP addressing information during an outage event - effectively minimizing the possibility of configuration errors. This means during steady-state operations, the same IP segment is available in both the source and destination sites, with the default gateway of the subnet remaining in Frankfurt:

VM1a and VM1b are deployed on extended segment 1 with IP addresses of 192.168.10.10 and 192.168.10.20 respectively, with the default gateway of 192.168.10.1 being in Frankfurt.
VM2a and VM2b are deployed on extended segment 2 with IP addresses of 192.168.20.10 and 192.168.20.20 respectively, with the default gateway of 192.168.20.1 being in Frankfurt.
VMAa is deployed in the local routed segment 192.168.30.0/24 in Frankfurt with IP address of 192.168.30.10 with a local .1 default gateway
VMBa is deployed in the local routed segment 192.168.40.0/24 at Oregon with IP address of 192.168.30.10 with a local .1 default gateway.
These segments can communicate to other segments either across the route-based layer-3 VPN tunnel or using the local gateways at Frankfurt, depending on destination.

A VMware Cloud Disaster Recovery connector appliance is deployed in the Frankfurt SDDC, replicating VM1a, VM2a and VMAa into the VMware Cloud Disaster Recovery instance in Oregon. Each VM is configured with a dedicated protection group and DR plan in the VMware Cloud Disaster Recovery orchestrator that will recover these virtual machines into Oregon with custom configurations.

Both SDDCs also are connected to an AWS Virtual Private Cloud (VPC) with the standard Elastic Network Interface (ENI) that is deployed as part of each VMware Cloud on AWS SDDC in the respective regions.

Figure 2: Virtual machine network flow with VMware HCX Mobility Optimized Networking disabled.

As shown in Figure 2:

VM1b (192.168.10.20) sends an ICMP Echo request (PING test) to VM2b (192.168.20.20).
Packet traverses the L2 extension via VMware HCX for Stretched Segment 1 back to the on-premises default gateway (192.168.10.1).
Packet is routed via the on-premises default gateway to Stretched Segment 2 (192.168.20.0/24).
Packet traverses the L2 extension via VMware HCX for Stretched Segment 2 to the cloud SDDC.
Packet arrives at VM2b (192.168.20.20).

Note: Networks configured for use with HCX Mobility Optimized Networks are supported as virtual machine networks in VMware Cloud on AWS, but not for use as recovery networks when used with VMware Cloud Disaster Recovery.

Outage Scenarios

There are several outage scenarios that might result in sufficient interruption to business to require an organization to fail critical services or applications over to an alternative environment. Whether natural or human-caused, it is important to identify what types of outage scenarios your organization might be at risk for, as well as those your organization desires to address. In this paper, we are examining situations that might cause a connectivity outage of some kind for either an entire site or one impacting specific workloads. While we will be specifically addressing infrastructure outages related to networking, some of the outage scenarios may be applicable to an outage of any kind, such as an outage of storage, hosts, individual server components, and the like.

Outage scenario 1: Outage of source site

Figure 3: Outage of source site.

Scenario Overview

One of the most common outage scenarios our customers wish to address is the complete outage of an entire site. This could be due to the loss of power to a building or region, a weather-related event that damages or destroys property or infrastructure, or something more nefarious, such as an infection of malware or ransomware. Whatever the cause, this scenario assumes the complete loss of services from the primary service site.

Before you begin

The following steps were used to simulate the actual outage and recovery of services as previously described. In an actual outage of your production site, your steps may vary.

Firewall rules at the source and destination need to be consistent to ensure all failover workloads can operate the same as production in the DR environment. Be sure to configure appropriate firewall rules, network segments, NAT rules (etc) as much as possible in advance to minimize the necessary effort during an actual outage. In the following steps, we update the firewall rules at the destination SDDC. This may or may not be needed in customer environments if the rules already exist.

Execution summary

Failover

Networking and communications are disabled and unavailable in the production site.
Change name resolution for one of the VMware HCX managers to disable communications between VMware HCX managers.
Disconnect the VMnic for the VMware HCX Interconnect (IX) and Network Extension (L2e) appliances to simulate outage of the service mesh.
Network extensions are forcibly unextended in the SDDC (and on premises, if possible).
Networks in the destination SDDC are set as routable segments.
Disaster recovery plans are executed to fail into the destination SDDC.

Failback

VPN/network connectivity is reestablished.
Networks are re-extended.
Failback plans are executed.

Test results

Workloads are able to recover into the destination SDDC and come online with original IP addresses.

Steps to replicate test

Disconnect or disable the VPN.

Figure 4:VPN is disabled.

Force unextend of the network on the local/source side. Forcing the network to unextend is necessary as the VMware HCX managers will not be able to communicate. In a true outage situation, it may not be possible to execute this step.

Figure 5:Force unextend networks on premises.

Force unextend the network in the cloud/recovery SDDC. Forcing the network to unextend is necessary as the VMware HCX managers will not be able to communicate. Check the box to allow the VMware HCX manager in the destination SDDC to connect the cloud network to the cloud edge gateway for routing capabilities.

Figure 6:Force unextend networks in the SDDC.

Figure 7:Connect the segment to the edge gateway router in the SDDC during the unextend operation.

Figure 8:Unextend in progress.

Update the destination network in the cloud to routed (if not already set to routed by VMware HCX). Note: This is a manual task, performed in the SDDC Networking and Security page. It is possible to automate this by using the NSX-T API for VMware Cloud on AWS.

Figure 9:Changing the network type from Disconnected to Routed.

Figure 10:Changing the network type from Disconnected to Routed.

Figure 11: Cloud segments updated to Routed.

Update firewall rules to allow inbound/outbound access to and from the newly routed network segments.

Figure 12: Update SDDC compute gateway firewall to allow communications to new routed segments.

Figure 13: Updated rules allow inbound and outbound traffic to new, routed segments.

Ensure the failover plan is configured to failover to the correct network.

Figure 14: Selecting the failover network.

Failover DR plans.

Figure 15: Failover DR plan.

Confirm IP address of the failover workloads and test connectivity.

Figure 16: Recovered workload confirming IP address.

Figure 17: Recovered workload confirming connectivity to other networks in the SDDC.

Figure 18: Recovered workload confirming internet connectivity.

Outage scenario 2: Outage of internet/WAN

Figure 19: Partial outage; outage of internet/WAN.

Scenario overview

In this case, WAN/internet connectivity has failed, cutting off communications from the production data center on premises to the VMware Cloud on AWS SDDC, housing its own production workloads. Even though the nature of this scenario results from a fundamentally different type of outage scenario, functionally speaking, this is the same as outage scenario 1. For those workloads deemed critical, if a disaster is declared, all impacted/flagged workloads will have to be recovered into the VMware Cloud on AWS SDDC.