Protecting VMware Cloud on AWS Outposts SDDC using VCDR

Introduction

In this article we are going to look at how we can protect a VMware Cloud on AWS Outpost (VMC-O) using VMware Cloud Disaster Recovery (VCDR).

If you are not familiar with VMware Cloud on AWS Outposts I encourage you to read the following two:

VMware Cloud on AWS Outposts - Solution Overview which covers the basics
VMware Cloud on AWS Outposts - Deep Dive which covers the details how it works and how it's connected from on-prem to AWS backbone, as well as what the network requirements are.

So, we have our nicely deployed Software Defined Data Center (SDDC) running on-prem on AWS Outposts, that is awesome! But what if we need to protect it from a disaster now? Can we leverage our existing VCDR suite? The answer is yes! Let's have a look how we can accomplish that.

Overview

This is what the logical design and the building blocks look like:

Protected Site

The VMC-O SDDC running on-prem in the customer data centre. Thanks to our VCDR solution, multiple DR as a Service (DRaaS) Connectors have been deployed locally, which will enable the storage replication from on-prem to the Scale-out Cloud File System (SCFS).

Please note: the SDDC is logically presented in a specific AWS Region and Availability Zone(AZ) which are chosen when ordering Outposts and can't be changed afterwards.

VMware Cloud Disaster Recovery (VCDR)

In this block we have the SaaS Orchestrator and Scale-out Coud File System (SCFS), the core components for VCDR and data replication.

SaaS Orchestrator: it's the orchestator engine that runs on the same AWS region as the recovery SDDC. It's responsible for constantly checking the consistency of the protection groups, monitoring the health of the SDDCs (both protected and recovery) as well as executing the DR plans
SCFS: an NFS based cloud storage, highly optimised, encrypted and catalogued, where snapshots are stored in native vSphere format. This file system can be live-mounted for instant recovery and it is used to retrieve backup snapshots to and from the protected and recovery site.

VMC on AWS SDDC

This represents the target (or recovery) SDDC instance, which can either be deployed On-Demand or as Pilot Light (differences explained in this article).

Now let's have a look at how easy it is to configure and protect your VMC on AWS Outposts SDDC using VCDR. I'm going to assume the VCDR has never been configured before, thus there will be steps in the following list which are part of the standard VCDR for VMC on AWS configuration, for which I will refer to the existing public documentation.

VCDR Configuration - Steps

At a high level, the configuration steps are exactly the same as if you were protecting an on-prem SDDC o VMC on AWS SDDC.

Configure the API Token
Deploy the Cloud file system
Configure the Protected site
Create a Protection Group
Configure the Recovery SDDC
Create the Recovery Plan
DR Failover (action the recovery plan)

The VCDR dashboard gives you a great overview with 6 main steps required, as following screenshot:

The only extra step we are going to add is the Ransonware Recovery add-on which is part of the recovery plan configuration. For the purpose of supporting VMC-O as Protected site on VCDR, the minimum required version for VCDR is 7.26.1. My testing environment was deployed as following:

	Protected Site VMC on Outposts	VCDR	Recovery Site VMC AWS	Ransomware Recovery
Availability zone	az1	az2	az2	az2
Region	US West (Oregon)	US West (Oregon)	US West (Oregon)	US West (Oregon)

As you can see from this table, the protected site and the recovery site can be in the same region (this is for customers who are restricted to one AWS region due to limited availability or data locality laws) however they must be in different availability zones.

API Token Configuration

An API token is required in order to authorise accessing this service within the organisation on the Cloud Console.

Configuring the API Token is very straighforward and the process is already documented on the public documentation for VCDR > Add the API Token

Please note: with upcoming future releases, configuring the API token will no longer be necessary.

Cloud File System Deployment

A cloud file system allows you to provision storage capacity to be allocated for your protected site. Such storage is where the replicated backup snapshots will reside and it is where the protected workloads data is going to be pulled from; when and if a recovery operation is initiated. Behind the hood, this datastore is mounted as NFS3 share and attached to the vSphere cluster.

The cloud file system and all recovery SDDCs must be in the same AZ inside one AWS Region. However the protected SDDC can't be in the same AZ as the recovery one, it must be on a different AZ, which gets selected during the intial file system deployment and that's where the NFS-based storage will run on.

Configuring a cloud file system is very well documented because it's part of the standard VCDR practice for VMC on AWS but to summarise this is what you're going to need in the process:

Protected site: select which SDDC you want to protect, it must be on a different AZ compared to the Recovery site
Recovery site: deploy a new one or select an existing one, which will be selectable as long as it's different from the Protected site. In any case this site can only be in the same AZ as the recovery AZ of choice.
Recovery AZ: used exclusively for recovery operations.

Set up a protected site

Now it's time to tell VCDR which site we want to protect, which involves deploying the DRaaS Connectors (virtual machines responsible for the data replication) into the vSphere cluster(s).

Please note: When protecting an SDDC using VMware Cloud DR, the recovery SDDC and VMware Cloud DR deployment must be in the same CSP organization as the protected SDDC.

A VCDR protected site includes vCenter Servers, protection group as well as recovery plans, looking like this:

For more details refer you the official VCDR documentation Set Up Protected Sites but here's the highlights:

you should deploy at least two connectors per-protected site, for redundancy purposes.
VCDR supports protecting up to 6,000 VMs on a site with a single vCenter Server for which you will need four separate protected sites, each with its own cloud file system (thus cloud file systems).
you should deploy one DRaaS Connector for every 250 VMs in the protected site, whether or not all 250 VMs will be protected.

The following firewall rules must be created on the Compute Gateway firewall on the Protected SDDC

Name	Source	Destination	Services	Action
CloudDR-ConnectorTovCenter	CloudDR-Connector-Segment (subnet)	CloudDR-vCenter (single IP)	HTTPS (443)	Allow
CloudDR-ConnectorToBackupSite	CloudDR-Connector-Segment (subnet)	CloudDR-BackupSite (subnet)	HTTPS (443)	Allow
CloudDR-ConnectorToOrchestrator	CloudDR-Connector-Segment (subnet)	CloudDR-Orchestrator (single IP or subnet)	HTTPS (443)	Allow
CloudDR-ConnectorToAutoSupportServer	CloudDR-Connector-Segment (subnet)	CloudDR-AutoSupportServer (single IP)	HTTPS (443)	Allow

The following firewall rules must be created on the Management Gateway firewall on the Protected SDDC

Name	Source	Destination	Services	Action
CloudDR-VCenterInboundFromConnector	CloudDR-Connector-Segment (subnet)	ProtectedSDDC-vCenter (single IP)	HTTPS (443)	Allow

After creating the protected site called "Incubation Outposts" we need to deploy at least a pair of connectors on each cluster where we have virtual machines to protect from a disaster.

To do that, from Protected Sites > Clusters select DEPLOY and you will be presented with the following step by step instructions:

The OVA deployment is a classic and you will need the provide following information:

OVA URL
VM name and folder
compute resource where it will run (remember you need 1 connecter on each cluster where you have VMs to protect)
accept EULA
select the datastore
select the network port group

After the first boot we must to configure the network settings and register the connector against the DRaaS backend.
From the DRaaS connector VM console, login with the credentials provided (see previous screenshot) and you will need perform the following:

configure the OVA network (DHCP or static)
enter the Orchestrator FQDN (see previous screenshot)
enter the temporary passcode
wait for the successful registration message.

Back to the VCDR GUI we can now see the connector was added successfully. Altough it is optional, for production enviroments we do recommend to deploy a second DR connector for redundancy as well to achieve better replication performance.

Create a protection group

A protection group is how you can group virtual machines together such that snapshots are taken in a consistent way. Dynamic grouping is possible thanks to the vCenter Server tags, VM name pattern or VM folder. In my example here I'm using a tag called dr-enabled

From Protection Group select Create Protection Group. You will need to:

name the protection group
select the protected site and whether or not you want to leverage High Frequency Snapshot (HFS)
select the criteria for VMs dynamic grouping within the protection group. This can be tags, VM name pattern or VM folder.
select the schedule for snapshots replication, which can be as low as every 30 minutes and can be kept for as hours, days, weeks or years.

Add a recovery SDDC

I'm going to assume the organisation decided to save money by not deploying a DR SDDC ahead of time (also known as Pilot Light). Back into the VCDR GUI, from Recovery SDDCs select Add Recovery SDDC. From here you will need to:

name the Recovery SDDC
select the host type (i3, i3en or i4i)
select the number of hosts
select the private management subnet, which will be used for vCenter Server, NSX Manager and ESXi hosts
select the private proxy subnet, which will be used for VMware Cloud DR proxy VMs and must be a /26
review the AWS region, account ID and VPC setting
confirm the SDDC deployment kick-off

In my case I'm going to deploy an SDDC on-demand (Just-in-Time) as following:

Name: CI Recovery SDDC
Hosts: 2, Type: i3
Management subnet: 10.192.112.0/20
Compute subnet: 192.168.1.0/24 for segment sddc-cgw-network-1
Cloud Proxy subnet: 10.68.97.0/26

After a couple of hours the SDDC should be up and running.

Create a recovery plan

A recovery plan will contain all the details of the virtual machines you are protecting, as well as the resource mappings between the source(or protected) SDDC and the target(or recovery) SDDC. Let's deep dive into what a recovery plan does contain.

Sites: you specify which one you're protecting and which SDDC you want to failover into
Groups: the protection groups which will be orchestrated during a recovery
vCenter Servers: mapping the protected and the recovery vCenter Server
vCenter Server folders: mapping the vCenter Server VMs and Template folder structure for both the protected and the recovery site.
Compute resources: mapping the protected and recovery vSphere clusters
Virtual networks: mapping the protected network port group into the recovery port groups.
IP addresses: here you specificy if and how you want to re-ip your workloads
Scripts: you have the option to initiate scripts (VMware Tools required) post-failover
Ransomware recovery(optional): if the service is enabled on the region, you have the option to enable ransomware recovery, which will initiate forensic analysis on the guest OS leveraging Carbon Black sensors
Alert: for additional email notifications

A recovery plan compliance is checked every 30 minutes, to make sure all configurations are validated and still applicable.

DR Failover Test and DR Failover

Testing a DR Failover is essential to make sure the plan is solid and will work when it comes to perform an actual disaster recovery failover. For this reason, you can perform a failover test, where you will be given the option to:

select which snapshot (consistent across all protected VMs) to get from the Cloud FS
stop or continue upon any error
run the VMs live on the Cloud FS or perform a full storage migration from the Cloud FS to recovery vSAN SDDC

Once the DR plan has been tested and validated once, it should become a standard exercise to test it regularly.

Now, let's assume our VMC on AWS Outposts on-prem has experienced a fire event and the rack is unrecoverable. Let's invoke a full DR Failover from the VCDR GUI. After selection the DR plan click on DR Failover and you will asked:

to review the compliance check and that everything is healthy
to select the snapshot to restore the protection group to
whether or not to stop on any runtime error
to select the storage where you want the failed-over VMs to run into (live-mounted SCFS or recovery vSAN datastore)
to review the failover plan
to confirm the failover execution start

When a plan has finished executing and all of the steps in the running plan workflow have completed it is mandatory to commit the failover, to essentially confirm you are happy with the outcome of the recovery. A Failback Plan can be automatically created on your behalf in order to reverse the workloads back to the protected site, if and when it will become available again. See Running a Recovery Plan for Failover

Upon completion and commit of the DR failover, you can download a PDF Report containing all the actions performed by the orchestrator, of which you can see 1 page sample here:

Failback Recovery Plan

The Failback process is essentially the opposite of the DR Failover. With one major difference: only the data that has changed (delta) will be "appended" back to the same snapshot which was used when the failover was invoked.

Quoting from the official documentation Run a Failback Recovery Plan the steps will be as following:

virtual machines are powered off on the recovery SDDC.
the last VM snapshot is taken with powered-off VMs. The differences between the VM state at the time of recovery and failback are then applied to the snapshot used for recovery to construct a VM backup on the cloud file system for subsequent retrieval.
the VM backups are then retrieved to a protected site system using a forever incremental protocol.
VMs are recovered (storage migration) to a protected site.
upon successful recovery, the VMs are automatically deleted from the recovery SDDC.

I should point out that any new virtual machine created on the Recovery SDDC will be excluded from the Failback plan because there were not part of the protected site.

Failing back is invoked by selecting the automatically created Failback Recovery Plan and starting the FAILOVER FROM VMC wizard

Once the failback has been performed, you will be able to see all the steps performed with task duration as well.

As with the DR Failover, you will be required to commit the changes to confirm you're happy with the outcome of the failback process, meaning you have tested and validated all VMs and services are working as expected.

Ransomware Recovery (Optional)

Ransonware recovery is an optional service that you can enable in your VMC Organisation, which will allow to restore workloads from the cloud file system into air-gapped staging envirornment, also referred to Isolated Recovery Environment (IRE). After a ransomware attack, you can launch a recovery plan specifying you want to perform ransomware recovery and selecting virtual machines from a deep snapshot history.

Such VMs will be powered-on into an IRE where forensic analysis and validation can be performed leveraging Carbon Black Cloud sensors installed on the guests operating systems, post recovery.

Ransomware recovery is fully supported for VMC on AWS Outposts and works the same way as your native on-prem vSphere or VMC on AWS SDDC. Covering Ransomware Recovery in full details is out of scope for this article. For all the details, principles and documentation please refer to the existing official documentation for VMware Cloud DR - Ransomware Recovery

Summary

In this article we have covered how we can protect from disaster an SDDC running on-prem as VMC on AWS Outposts (VMC-O).

More specifically, we have established the VCDR can protect and successfully failover/failback workloads running on VMC-O while adhering to the processes documented for protecting VMC-AWS with VCDR. The same configuration constraints regarding availabilty zones for VMC on AWS and VCDR applies to VMC-O.

Associated Content

From the action bar MORE button.

Filter Tags

DRaaS Cloud Disaster Recovery Disaster Recovery VMware Cloud on AWS Outposts Document Announcement Technical Guide Technical Overview Technical Walkthrough Intermediate