Protecting VMware Cloud on AWS Outposts SDDC using VCDR
Introduction
In this article we are going to look at how we can protect a VMware Cloud on AWS Outpost (VMC-O) using VMware Cloud Disaster Recovery (VCDR).
If you are not familiar with VMware Cloud on AWS Outposts I encourage you to read the following two:
-
VMware Cloud on AWS Outposts - Solution Overview which covers the basics
- VMware Cloud on AWS Outposts - Deep Dive which covers the details how it works and how it's connected from on-prem to AWS backbone, as well as what the network requirements are.
So, we have our nicely deployed Software Defined Data Center (SDDC) running on-prem on AWS Outposts, that is awesome! But what if we need to protect it from a disaster now? Can we leverage our existing VCDR suite? The answer is yes! Let's have a look how we can accomplish that.
Overview
This is what the logical design and the building blocks look like:
Protected Site
The VMC-O SDDC running on-prem in the customer data centre. Thanks to our VCDR solution, multiple DR as a Service (DRaaS) Connectors have been deployed locally, which will enable the storage replication from on-prem to the Scale-out Cloud File System (SCFS).
Please note: the SDDC is logically presented in a specific AWS Region and Availability Zone(AZ) which are chosen when ordering Outposts and can't be changed afterwards.
VMware Cloud Disaster Recovery (VCDR)
In this block we have the SaaS Orchestrator and Scale-out Coud File System (SCFS), the core components for VCDR and data replication.
- SaaS Orchestrator: it's the orchestator engine that runs on the same AWS region as the recovery SDDC. It's responsible for constantly checking the consistency of the protection groups, monitoring the health of the SDDCs (both protected and recovery) as well as executing the DR plans
- SCFS: an NFS based cloud storage, highly optimised, encrypted and catalogued, where snapshots are stored in native vSphere format. This file system can be live-mounted for instant recovery and it is used to retrieve backup snapshots to and from the protected and recovery site.
VMC on AWS SDDC
This represents the target (or recovery) SDDC instance, which can either be deployed On-Demand or as Pilot Light (differences explained in this article).
Now let's have a look at how easy it is to configure and protect your VMC on AWS Outposts SDDC using VCDR. I'm going to assume the VCDR has never been configured before, thus there will be steps in the following list which are part of the standard VCDR for VMC on AWS configuration, for which I will refer to the existing public documentation.
VCDR Configuration - Steps
At a high level, the configuration steps are exactly the same as if you were protecting an on-prem SDDC o VMC on AWS SDDC.
- Configure the API Token
- Deploy the Cloud file system
- Configure the Protected site
- Create a Protection Group
- Configure the Recovery SDDC
- Create the Recovery Plan
- DR Failover (action the recovery plan)
The VCDR dashboard gives you a great overview with 6 main steps required, as following screenshot:
The only extra step we are going to add is the Ransonware Recovery add-on which is part of the recovery plan configuration. For the purpose of supporting VMC-O as Protected site on VCDR, the minimum required version for VCDR is 7.26.1. My testing environment was deployed as following:
Protected Site VMC on Outposts | VCDR | Recovery Site VMC AWS | Ransomware Recovery | |
---|---|---|---|---|
Availability zone | az1 | az2 | az2 | az2 |
Region | US West (Oregon) | US West (Oregon) | US West (Oregon) | US West (Oregon) |
As you can see from this table, the protected site and the recovery site can be in the same region (this is for customers who are restricted to one AWS region due to limited availability or data locality laws) however they must be in different availability zones.
API Token Configuration
An API token is required in order to authorise accessing this service within the organisation on the Cloud Console.
Configuring the API Token is very straighforward and the process is already documented on the public documentation for VCDR > Add the API Token
Please note: with upcoming future releases, configuring the API token will no longer be necessary.
Cloud File System Deployment
A cloud file system allows you to provision storage capacity to be allocated for your protected site. Such storage is where the replicated backup snapshots will reside and it is where the protected workloads data is going to be pulled from; when and if a recovery operation is initiated. Behind the hood, this datastore is mounted as NFS3 share and attached to the vSphere cluster.
The cloud file system and all recovery SDDCs must be in the same AZ inside one AWS Region. However the protected SDDC can't be in the same AZ as the recovery one, it must be on a different AZ, which gets selected during the intial file system deployment and that's where the NFS-based storage will run on.
Configuring a cloud file system is very well documented because it's part of the standard VCDR practice for VMC on AWS but to summarise this is what you're going to need in the process:
- Protected site: select which SDDC you want to protect, it must be on a different AZ compared to the Recovery site
- Recovery site: deploy a new one or select an existing one, which will be selectable as long as it's different from the Protected site. In any case this site can only be in the same AZ as the recovery AZ of choice.
- Recovery AZ: used exclusively for recovery operations.
Set up a protected site
Now it's time to tell VCDR which site we want to protect, which involves deploying the DRaaS Connectors (virtual machines responsible for the data replication) into the vSphere cluster(s).
Please note: When protecting an SDDC using VMware Cloud DR, the recovery SDDC and VMware Cloud DR deployment must be in the same CSP organization as the protected SDDC.
A VCDR protected site includes vCenter Servers, protection group as well as recovery plans, looking like this:
For more details refer you the official VCDR documentation Set Up Protected Sites but here's the highlights:
- you should deploy at least two connectors per-protected site, for redundancy purposes.
- VCDR supports protecting up to 6,000 VMs on a site with a single vCenter Server for which you will need four separate protected sites, each with its own cloud file system (thus cloud file systems).
- you should deploy one DRaaS Connector for every 250 VMs in the protected site, whether or not all 250 VMs will be protected.
The following firewall rules must be created on the Compute Gateway firewall on the Protected SDDC
Name | Source | Destination | Services | Action |
---|---|---|---|---|
CloudDR-ConnectorTovCenter | CloudDR-Connector-Segment (subnet) |
CloudDR-vCenter (single IP) | HTTPS (443) | Allow |
CloudDR-ConnectorToBackupSite | CloudDR-Connector-Segment (subnet) |
CloudDR-BackupSite (subnet) |
HTTPS (443) | Allow |
CloudDR-ConnectorToOrchestrator | CloudDR-Connector-Segment (subnet) |
CloudDR-Orchestrator (single IP or subnet) | HTTPS (443) | Allow |
CloudDR-ConnectorToAutoSupportServer | CloudDR-Connector-Segment (subnet) |
CloudDR-AutoSupportServer (single IP) | HTTPS (443) | Allow |
The following firewall rules must be created on the Management Gateway firewall on the Protected SDDC
Name | Source | Destination | Services | Action |
---|---|---|---|---|
CloudDR-VCenterInboundFromConnector | CloudDR-Connector-Segment (subnet) |
ProtectedSDDC-vCenter (single IP) |
HTTPS (443) | Allow |
After creating the protected site called "Incubation Outposts" we need to deploy at least a pair of connectors on each cluster where we have virtual machines to protect from a disaster.
To do that, from Protected Sites > Clusters select DEPLOY and you will be presented with the following step by step instructions:
The OVA deployment is a classic and you will need the provide following information:
- OVA URL
- VM name and folder
- compute resource where it will run (remember you need 1 connecter on each cluster where you have VMs to protect)
- accept EULA
- select the datastore
- select the network port group
After the first boot we must to configure the network settings and register the connector against the DRaaS backend.
From the DRaaS connector VM console, login with the credentials provided (see previous screenshot) and you will need perform the following:
- configure the OVA network (DHCP or static)
- enter the Orchestrator FQDN (see previous screenshot)
- enter the temporary passcode
- wait for the successful registration message.
Back to the VCDR GUI we can now see the connector was added successfully. Altough it is optional, for production enviroments we do recommend to deploy a second DR connector for redundancy as well to achieve better replication performance.
Create a protection group
A protection group is how you can group virtual machines together such that snapshots are taken in a consistent way. Dynamic grouping is possible thanks to the vCenter Server tags, VM name pattern or VM folder. In my example here I'm using a tag called dr-enabled
From Protection Group select Create Protection Group. You will need to:
- name the protection group
- select the protected site and whether or not you want to leverage High Frequency Snapshot (HFS)
- select the criteria for VMs dynamic grouping within the protection group. This can be tags, VM name pattern or VM folder.
- select the schedule for snapshots replication, which can be as low as every 30 minutes and can be kept for as hours, days, weeks or years.
Add a recovery SDDC
I'm going to assume the organisation decided to save money by not deploying a DR SDDC ahead of time (also known as Pilot Light). Back into the VCDR GUI, from Recovery SDDCs select Add Recovery SDDC. From here you will need to:
- name the Recovery SDDC
- select the host type (i3, i3en or i4i)
- select the number of hosts
- select the private management subnet, which will be used for vCenter Server, NSX Manager and ESXi hosts
- select the private proxy subnet, which will be used for VMware Cloud DR proxy VMs and must be a /26
- review the AWS region, account ID and VPC setting
- confirm the SDDC deployment kick-off
In my case I'm going to deploy an SDDC on-demand (Just-in-Time) as following:
- Name: CI Recovery SDDC
- Hosts: 2, Type: i3
- Management subnet: 10.192.112.0/20
- Compute subnet: 192.168.1.0/24 for segment sddc-cgw-network-1
- Cloud Proxy subnet: 10.68.97.0/26
After a couple of hours the SDDC should be up and running.
Create a recovery plan
A recovery plan will contain all the details of the virtual machines you are protecting, as well as the resource mappings between the source(or protected) SDDC and the target(or recovery) SDDC. Let's deep dive into what a recovery plan does contain.
- Sites: you specify which one you're protecting and which SDDC you want to failover into
- Groups: the protection groups which will be orchestrated during a recovery
- vCenter Servers: mapping the protected and the recovery vCenter Server
- vCenter Server folders: mapping the vCenter Server VMs and Template folder structure for both the protected and the recovery site.
- Compute resources: mapping the protected and recovery vSphere clusters
- Virtual networks: mapping the protected network port group into the recovery port groups.
- IP addresses: here you specificy if and how you want to re-ip your workloads
- Scripts: you have the option to initiate scripts (VMware Tools required) post-failover
- Ransomware recovery(optional): if the service is enabled on the region, you have the option to enable ransomware recovery, which will initiate forensic analysis on the guest OS leveraging Carbon Black sensors
- Alert: for additional email notifications
A recovery plan compliance is checked every 30 minutes, to make sure all configurations are validated and still applicable.
DR Failover Test and DR Failover
Testing a DR Failover is essential to make sure the plan is solid and will work when it comes to perform an actual disaster recovery failover. For this reason, you can perform a failover test, where you will be given the option to:
- select which snapshot (consistent across all protected VMs) to get from the Cloud FS
- stop or continue upon any error
- run the VMs live on the Cloud FS or perform a full storage migration from the Cloud FS to recovery vSAN SDDC
Once the DR plan has been tested and validated once, it should become a standard exercise to test it regularly.
Now, let's assume our VMC on AWS Outposts on-prem has experienced a fire event and the rack is unrecoverable. Let's invoke a full DR Failover from the VCDR GUI. After selection the DR plan click on DR Failover and you will asked:
- to review the compliance check and that everything is healthy
- to select the snapshot to restore the protection group to
- whether or not to stop on any runtime error
- to select the storage where you want the failed-over VMs to run into (live-mounted SCFS or recovery vSAN datastore)
- to review the failover plan
- to confirm the failover execution start
When a plan has finished executing and all of the steps in the running plan workflow have completed it is mandatory to commit the failover, to essentially confirm you are happy with the outcome of the recovery. A Failback Plan can be automatically created on your behalf in order to reverse the workloads back to the protected site, if and when it will become available again. See Running a Recovery Plan for Failover
Upon completion and commit of the DR failover, you can download a PDF Report containing all the actions performed by the orchestrator, of which you can see 1 page sample here:
Failback Recovery Plan
The Failback process is essentially the opposite of the DR Failover. With one major difference: only the data that has changed (delta) will be "appended" back to the same snapshot which was used when the failover was invoked.
Quoting from the official documentation Run a Failback Recovery Plan the steps will be as following:
- virtual machines are powered off on the recovery SDDC.
- the last VM snapshot is taken with powered-off VMs. The differences between the VM state at the time of recovery and failback are then applied to the snapshot used for recovery to construct a VM backup on the cloud file system for subsequent retrieval.
- the VM backups are then retrieved to a protected site system using a forever incremental protocol.
- VMs are recovered (storage migration) to a protected site.
- upon successful recovery, the VMs are automatically deleted from the recovery SDDC.
I should point out that any new virtual machine created on the Recovery SDDC will be excluded from the Failback plan because there were not part of the protected site.
Failing back is invoked by selecting the automatically created Failback Recovery Plan and starting the FAILOVER FROM VMC wizard
Once the failback has been performed, you will be able to see all the steps performed with task duration as well.
As with the DR Failover, you will be required to commit the changes to confirm you're happy with the outcome of the failback process, meaning you have tested and validated all VMs and services are working as expected.
Ransomware Recovery (Optional)
Ransonware recovery is an optional service that you can enable in your VMC Organisation, which will allow to restore workloads from the cloud file system into air-gapped staging envirornment, also referred to Isolated Recovery Environment (IRE). After a ransomware attack, you can launch a recovery plan specifying you want to perform ransomware recovery and selecting virtual machines from a deep snapshot history.
Such VMs will be powered-on into an IRE where forensic analysis and validation can be performed leveraging Carbon Black Cloud sensors installed on the guests operating systems, post recovery.
Ransomware recovery is fully supported for VMC on AWS Outposts and works the same way as your native on-prem vSphere or VMC on AWS SDDC. Covering Ransomware Recovery in full details is out of scope for this article. For all the details, principles and documentation please refer to the existing official documentation for VMware Cloud DR - Ransomware Recovery
Summary
In this article we have covered how we can protect from disaster an SDDC running on-prem as VMC on AWS Outposts (VMC-O).
More specifically, we have established the VCDR can protect and successfully failover/failback workloads running on VMC-O while adhering to the processes documented for protecting VMC-AWS with VCDR. The same configuration constraints regarding availabilty zones for VMC on AWS and VCDR applies to VMC-O.