VMware Cloud Disaster Recovery - Implementation Practices Guide

Overview

Introduction

This guide will cover many of the implementation practices identified that will help improve the setup and operation of VMware Cloud Disaster Recovery.

VMware Cloud Disaster Recovery provides a framework for protecting VMware workloads - either in on-premises vCenter environments or in VMC on AWS SDDCs There are protected to a cloud-based repository (SCFS) of Recovery Points that can be used to fail over VM operations to a Recovery SDDC running in VMware Cloud on AWS.

This document will evolve over time as new practices and areas of consideration are added. It is suggested to use the dynamic html link to the guide whenever possible to reference the content.

Purpose of This Guide

This guide takes you through several areas of using VMware Cloud Disaster Recovery in a variety of situations. The goal is to provide insights and criteria that can be applied to your specific deployment scenarios.

For more information, see the Product Documentation.

Audience

This guide is intended for IT administrators and product evaluators who are familiar with VMware Cloud Disaster Recovery. Familiarity with VMware vSphere and VMware vCenter Server as well familiarity with networking and storage in a virtual environment, Active Directory, identity management, and directory services is assumed.

Prerequisites

This guide assumes a basic working knowledge of VMware Cloud on AWS, VMware Cloud Disaster Recovery, vSphere and vCenter environments.

Protecting Workloads

This section will focus on the elements of VMware Cloud Disaster Recovery that pertain to protecting the VM workloads in preparation for disaster recovery failover needs.

Protecting workloads with VMware Cloud DR involves setting up the Protected Site(s) and defining Protection Group policies.

These are documented in more detail in the product documentation here:

Implementation considerations for these two operational setup tasks are discussed below.

Protected Sites

Understanding how the data flows from the Protected Site DRaaS Connector(s) to the Scale-out Cloud File System (SCFS) will help in designing and implementing a good site topology.

Each VMware Cloud DR instance can support multiple Protected Sites and each Protected Site can support multiple DRaaS Connectors and multiple vCenter Server registrations. The current configurations limits on these are found in the online ConfigMax tool. Each DRaaS Connector has a recommended limit of VMs from the VCenter inventories that it will need to handle.

These limits need to be considered to provide optimal resiliency and performance considerations.

Some other logistic considerations are outlined below.

Practice – network connectivity for replication traffic

Note that VMware Cloud DR does not carry or configure user traffic to failed over VMs. Please see the VMC on AWS documentation for user traffic options.

The replication connection can be configured over a public interface (either internet or DX) or over a private DX method:

  • public internet including AWS Public VIF Direct Connect (DX) – with this option VMware Cloud DR securely replicates VMs via SSL over the public internet to the SCFS service. In the use of an AWS public VIF traffic will enter your private side of the connection and exit the public side of the AWS network.
  • private interface via Direct Connect (DX) – alternatively customers can use an AWS Private VIF connection if they wish to keep all replication off public infrastructure either over the internet or within the AWS network.

Recommendation: use the network connection method that provides your organization the appropriate bandwidth and security attachment method. The available bandwidth may be greater than the actual used bandwidth based on a number of other architectural choices (e.g., concurrency) implemented in the Protected Site setups.

Practice – connector instances

While a single connector appliance per Protected site is supported, it is always highly recommended that you deploy at least two connectors. Multiple connectors within a site will improve load balancing and increase the availability of the protection tasks should something happen to one of the connectors.

There is a limit to the amount of additional bandwidth achieved with additional connectors from a given site. Current practices show that ~4-5 connectors from a single Protected Site will likely achieve the maximum network flow bandwidth for that site.

A single VMware Cloud DR instance has an upper limit on the total number of Protected sites or DRaaS Connectors supported.

Recommendation: check the current ConfigMax levels when designing the overall topology of sites and number of connectors planned so that you stay within the bounds of each deployed VMware Cloud DR instance.

Practice – Protected Site topology considerations

A Protected Site is a combination of one or more DRaaS Connectors and one or more registered vCenter servers. The VMs within the inventory of the vCenter server(s) will get protected through the associated DRaaS Connector operations. The DRaaS Connectors gets deployed to an ESXi host/cluster managed by the vCenter server. In some situations, a singe vCenter may be managing multiple physical locations.

The DRaaS Connectors within a Protected Site are stateless with respect to the VMs that they are tasked to protect. For sites where a single vCenter may have multiple physical sites under its control, a connector could end up accessing a VM in a different cluster than where the connector is actively running.

Recommendation: for more complicated Protected site configurations, work with the VMware Cloud DR support team to pin your connectors to the clusters from which they are replicating the VMs.

Practice – PCI Compliant SDDC

  • Protected Site SDDC – due to a requirement to access the NSX-T API's during the configuration of the protected site, you will need to work with the VMware Cloud GSS Support Team to have it temporarily turned off. Once the SDDC is protected you are then able to turn PCI compliance back on.
  • Recovery Site SDDC – deploying a new SDDC - due to a requirement to access the NSX-T API’'s during the deployment of VMware Cloud DR, it is recommended that VMware Cloud DR be deployed and attached to the Recovery SDDC before turning on PCI compliance.
  • Recovery Site SDDC – attaching an existing SDDC – due to a requirement to access the NSX-T API’'s during the attachment, you will need to work with the VMware Cloud GSS Support Team to have it temporarily turned off. Once the SDDC is attached you are then able to turn PCI compliance back on.

Recommendation: for more complicated PCI compliance configurations, work with the VMware Cloud GSS support team to configure the NSX-T configuration.

Protection Groups

First let’s cover some Protection Group (PG) basics.  Each PG policy defines:

  • an inventory of VMs based on selection criteria (name patterns, folder location, vCenter tags)
  • a choice of snapshot method
    • either Standard Frequency for 4-hour RPO or High-Frequency for 30-minute RPO – note that Standard Frequency snapshots also support guest level quiesce capabilities as outlined in the documentation
  • a set of schedules for taking snapshot recovery points and replicating data to the SCFS
  • a retention period for each schedule for how long to keep the recovery points in the SCFS inventory

Practice – inventory considerations and DR plans

VMs can belong to more than one PG and the VMs will be processed based on each active PG policy. When VMs are included in multiple PGs, there may be unnecessary snapshot schedules and overlap that are not readily apparent. A PG definition can support up to 2000 total snapshots and thus support a robust schedule / retention plan to avoid creating multiple PGs just for scheduling.

Also, each VM in the PG inventory must be processed by a DR plan Recovery step. These steps include:

  • Recover all VMs in the PG as a single Recovery step
  • Recover individual VMs from a specific PG
  • Recover all remaining VMs, files and groups as a “catch-all” for the DR plan

The granularity applied when building PGs can have a direct effect on the flexibility and processing of DR plans.

Recommendation: do not include VMs in multiple PGs unless there is a specific reason to do so.

If running a DR plan – either for test or failover – and there is already a VM in the Recovery Site that matches a VM in the running plan, there will be a DR plan error. The error could be ignored with manual administrator response.

Recommendation: if protecting more than one site to the same Recovery site, make sure that VM naming conventions are unique across and between sites to avoid inventory conflict should more than one site need to use the Recovery site at the same time.

Practice – inventory mappings to the Recovery SDDC

PGs are the basic building blocks for the inventory specified for failover (and failback) for DR plans. All VMs in a PG are processed by the Orchestrator when that PG is included in a DR plan. In the construction of the DR plan, you will need to establish mappings (folders, resources, networks) for each VM in the PG inventory. The source side of the mapping will be discovered by the Orchestrator, the destination side targets will need to be created in the Recovery SDDC before the mapping can be established in the DR plan.

Note that currently, DR plan mappings are a 1-1 relationship. You cannot map multiple source (Protected Site) items onto the same (Recovery Site) destination items.

Recommendation: create the Recovery SDDC mapping targets (networks, folders, and resources) before constructing the DR plan, otherwise you may need to exit the plan construction simply to add another desired folder, resource group or network target for the desired mappings.

Practice – what to protect – what not to protect

VMware Cloud Disaster Recovery protects VMs that are in the vCenter inventory that has been registered for a Protected Site and whose VMDK files reside on a supported datastore and VM configuration. For the most part, this covers a broad range of application VMs, but there are some considerations for what to exclude from a VMware Cloud DR snapshot protection method.

It is always good to check the latest documentation and release notes, but the following list identifies many situations that are not covered by VMware Cloud DR snapshots:

  • RDMs, Guest Attached, independent disks, and non-Datastore disks
  • Service VMs included Tanzu containers
  • Windows 11 VMs
  • Any VM configured with vTPM
  • Shared disks and clustered VMs

Recommendation: Review the inventory that is required for disaster recovery and identify alternate methods (e.g., application level, HCX, etc.) that can be used to enable DR for VMs that are required but not supported with VMware Cloud DR methods.

Practice – Protection Group scope, size, and granularity

Outside of scalability concerns for the total number of VMs to protect, the first choice in designing PG policies is the number and purpose of Protection Groups. In a small environment, it might make sense to put everything into a single or very small number of protection groups.

The size of the environment to protect is driven by a few

  • Scope – what is the role of the VMs in the inventory selection
  • Size – how many VMs (and vmdk) are being protected
  • Granularity – what is the recovery set size – this can affect certain aspects of protection and failover concurrency

Recommendation: Keep the number of protection groups to the minimum needed to satisfy business requirements. Unnecessary creation of more protection groups can add complexity. In other words, keep things as simple as possible as this leads to more reliable disaster recovery.

In summary we have:

Single Protection Group

Pro

Con

  • Simple to manage – all of the VM inventory is in one list
  • Good for small environments – 10s of VMs
  • Simple schedules – one policy for all VMs (uniform RPO and retention)
  • Everything in PG must be part of the DR plan failover
  • Minimal concurrency for snapshots and replication tasks
  • Lesser control over granularity of operations and policies

Multiple Protection Groups

Pro

Con

  • More flexible recovery policy definition and selection (e.g., per application)
  • Easier to order recovery tasks – per PG instead of per VM
  • Scalability of DR plans beyond single PG limits
  • Ensuring inventory coverage is complete – need to combine VM list from multiple plans
  • More management points – potential for problems
  • Potential for VM inventory inclusion overlap

Concerns:

  • Replication failures impact only the VM’s in the effected PG. Unaffected PG’s will continue to replicate.

There are situations where a single VM or just a few VMs per PG makes a lot of sense, or is a requirement. For example, typical VMs that fit in this category are:

  • MsSQL, MS Exchange or MS AD VMs that may all require VSS and the rest of the fleet doesn’t need VSS or High Frequency Snapshots are in use.
  • All VM’s for a single application or possibly multiple app’s serviced by a single other VM (e.g., DB server).

Practice – inventory selection method

When a PG is created, it will have a snapshot schedule or schedules associated with the PG. Every time the schedule runs, it evaluates the inventory specification providing a dynamic inventory capability. Consider how to maximize this auto-protect coverage of PGs and minimize chances that a newly created VM does not get included in the desired PGs.

As stated earlier, there are 3 methods for including a VM into a PG policy. They are: Naming Pattern, Folder Location, and vCenter Tags. You can apply one or more of these in the Protection Group policy setup to provide the proper inventory selection for each scenario.

Naming Pattern

Pro

Con

  • Easy to adopt in small deployment
  • Can use the * wildcard and make sure every VM is protected
  • Can use naming patterns to exclude VMs equally as well
  • VM name changes can disrupt the protection
  • Requires a very well-defined naming convention
  • Difficult to manage large VMs inventory – easy to pick up unwanted items

Folder Location

Pro

Con

  • Easy to use, intuitive and very “user friendly”
  • VM can be only associated to a single folder, and this can be incompatible with other solutions or company rules

vCenter Tags

Pro

Con

  • Easy to scale
  • Multiple tags can be associated with a VM
  • Easy to add or remove a VM from a PG
  • The same vCenter Tags are required in the Recovery SDDC or there will be compliance errors in related DR plans

NOTE: there is a concern that VM’s which get included in multiple PG’s could lead to excessive snapshot processing and inventory mapping issues when using the overlapped PGs for DR plans.

Recommendation: review Protection Group inventories and schedules and adjust as needed. Snapshot recovery point inventories can also be reviewed from the Virtual Machine view in the Orchestrator.

Practice – snapshot method

VMware Cloud DR provides 2 types of snapshot methods that can be configured for a Protection Group.

Standard Frequency Snapshots – this method allows for a minimum supported frequency of 4 hours, uses VMware vSphere Storage APIs – Data Protection, and provides the capability to quiesce the guest VM through VMware Tools.

Standard Frequency Snapshots

Pro

Con

  • Industry standard – long time in the market
  • VSS support
  • Recovers previous unfinished replications
  • Can protect with VMware Cloud DR and other replications products *restrictions apply
  • Supported for version of vSphere prior to 7.03
  • VMware Tools will flush disk memory buffer
  • VM Stun sometimes disrupts guest operations
  • RPO is supported to only 4-hour schedule frequency
  • Must offset scheduling times from 3rd party backup products that use VADP
  • Security: Requires access to port 902 on each ESXi

High Frequency Snapshots

Pro

Con

  • Does not stun the VM
  • Can get down to every 30 min
  • Security: The connection is established by the host rather than the connector
  • 3rd party backups are not impacted by VMware Cloud DR replication
  • Uses a more recently developed technology
  • No VSS currently available
  • First enablement on a host can stun the VM’s
  • Do after hours – 3rd party backup could cause replication failures – there are some timing overlap considerations
  • Can’t not use VIO tech (SRM, Zerto, RecoverPoint, etc.) on those same VM’s
  • No checkpointing

Quiesce Note: – https://docs.vmware.com/en/VMware-Cloud-Disaster-Recovery/services/vmware-cloud-disaster-recovery/GUID-DBF1DFD5-F956-4ED9-AF06-95664D3AA89D.html

Recommendation: apply the snapshot method that best suits the VM protection and recovery needs for snapshot frequency and guest OS quiesce integration.

Practice – snapshot schedules

When defining the Protection Group policies, the schedule structure is the next most important consideration. The schedules define the RPO (frequency) and the retention (depth of recovery points) you will have for each PG.

What is the ideal number of schedules per Protection Group? The answer varies based on specific use cases and SLA needs, but the goal should be to use the least number of schedules to accomplish the business goals and to try to avoid multiple PGs just to address schedule needs for the same VMs.

It is also good to note here that schedules within a PG that occur at the same time will get combined into a common recovery point. For example: if the hourly snapshot occurs at the same time as the daily snapshot (e.g., midnight) then only one recovery point will be established that addresses both point in time references and will be retained for the longer of the coinciding schedules.

Backup schedules default to midnight, 00 and :30 minutes but you can stagger PG schedules to avoid triggering large numbers of simultaneous backup requests.

What we see in practice are typically 3 levels of schedules as follows:

  • Operational recovery – shorter frequency and shorter retentions – for example 30 minutes to dailies
  • Ransomware recovery – slightly longer frequencies and retentions – these will typically span weeks or months based on malware dwell time analysis objectives
  • Compliance recovery – longer term – both frequency and retention – these tend to be monthly and kept for years

A good practice is to use multiple schedules to give retention depth, simplicity, and efficiency. Here is an example of how what those schedule/retention policies might look like and for what use case?

  • Operational DR
    • 30min-12hour @ 1 week – Critical Systems
    • Daily @ 1-2 week
    • Weekly @ 1-2 month
  • Ransomware Recovery
    • 30min-12hour @ 1 Week – Critical Systems
    • Daily @ 1-2 Month
    • Weekly @ 6-9 Months
  • Long Term Archive / Compliance
    • Daily @ 1 Month
    • Weekly @ 6 Months
    • Monthly @ 1+ Year

With the ability to retain up to 2,000 recovery points per Protection Group, you can see that just a few schedules provide a robust solution and leave plenty of room for expansion if needed. For example. The following 4 schedules combine all 3 needs and consume less than 500 recovery points (even fewer based on some schedule sequence overlap):

  • 30 minutes for 1 week = 336
  • Daily for 2 months = 60
  • Weekly for 6 months = 26
  • Monthly for 3 years = 36

Recommendation: avoid constructing Protection Groups simply to enable another schedule while at the same time define the minimum number of schedules to enable the desired protections SLAs.

Recommendation: evaluate business SLAs against snapshot schedule frequency and set schedule to levels that provide the desired RPO – avoid excessive snapshots and replication traffic if not required – in other words don’t set schedule for every 30 minutes if every 2 hours is sufficient.

Recommendation: you can set recovery point retention for longer than periods if unsure – it is easier to delete unwanted recovery points to reclaim space than to go back in time to enable deeper retention.

Recommendation: for environments with several or more Protection Groups, consider staggering the schedule times to allow better interleaving of snapshot and replication traffic over time.

Filter Tags

DRaaS Cloud Disaster Recovery Disaster Recovery VMware Cloud on AWS Document Technical Guide Intermediate