Well-Architected Design: VMware Cloud Disaster Recovery - Component Design

Introduction

VMware Cloud Disaster Recovery safeguards your virtual machines (VMs), whether they are hosted on-premises or in a VMware Cloud on AWS Software-Defined Data Center (SDDC). Virtual machines replicate to the Scale-out Cloud File System (SCFS) and can be restored to a VMware Cloud on AWS SDDC. VMware Cloud Disaster Recovery offers configuration and recovery options that can be used depending on organizational needs.

Scope of the Document

There are many aspects to consider while creating a disaster recovery solution. This guide details the VMware Cloud Disaster Recovery components and outlines configuration options and considerations.

Summary and Consideration

Use Case	Ransomware recovery Disaster recovery Disaster avoidance
Pre-requisites	VMC (VMware Cloud) on AWS subscription VMware Cloud Disaster Recovery subscription Internet or Direct Connect connection from the protected site location to the Scale-out Cloud File System. Ransomware recovery enabled VMs must have VMware tools installed.
General Considerations/Recommendations	Sizing single or multiple SCFS to either recover in one recovery SDDC or different recovery SDDC. Availability Zone and Region consideration for the recovery SDDC Grouping VMs in protection groups which are part of a service or application. The creation of a recovery plan based on the SOP (Standard Operating Procedure) as defined by application dependencies and interoperability requirements.
Performance Considerations	Replication schedules to ensure recovery point objective (RPO) and recovery time objective (RTO) matches with the recovery objectives. Sizing considerations of DRaaS Connectors. Creation of multiple cloud file systems for mapping different recovery SDDCs.
Network Considerations/Recommendations	Bandwidth consideration when replication data is transferred to cloud file system. Connection type from protected sites to Backup site (internet/Direct Connect)
Cost Implications	If the protected site is a VMware Cloud on AWS SDDC, the egress replication data will be subject to network costs depending on AWS pricing. Multiple cloud file system and multiple VMware Cloud on AWS recovery SDDC may increase cost. Please see the VMware Cloud Disaster Recovery pricing page and FAQ page for further information on the costs involved.
Document Reference	VCDR Product Documentation
Last Updated	May 2023

VMware Cloud Disaster Recovery Solution Components

Cloud File System

VMware Cloud Disaster Recovery uses a cloud backup storage technology called the Scale-out Cloud File System (SCFS). SCFS provides storage for replicated virtual machines that can be restored in case of a disaster. This is also known as the cloud file system.

The cloud backup location provides:

Flexible recovery options - SCFS enables flexible recovery options to recovery plans during a full-site failover and partial failover.
Scalability - The SCFS is highly scalable, enabling organizations to easily consume additional capacity as needed to meet their growing needs.
Cost-effectiveness - Using the SCFS can be a cost-effective disaster recovery solution. It eliminates the need for organizations to invest in additional hardware or infrastructure. SCFS with the on-demand SDDC deployment method can further increase cost efficiencies.
Offsite backup - Virtual machines and applications can be backed up to SCFS, providing an offsite backup location without the need to invest in a separate customer managed physical solution.

Below are the design considerations of the cloud file system to ensure proper functionality and performance:

Design Consideration	Description
SCFS capacity	The capacity of the SCFS at the backup site should be sufficient to hold all the virtual machine data being replicated. This capacity calculation must include the initial and Incremental data along with the frequency of the replication and retention configuration. It is important to ensure that there is enough space on the SCFS to avoid any replication failures. Refer to DR planning tool here.
SCFS performance	The performance of the SCFS should be sufficient to handle the replication workload. If necessary, contact support for configuration guidance for any performance issue.
Data retention	The retention policy for data on the SCFS must be configured properly. This includes setting the retention period for replicated data and managing the storage utilization on the SCFS.

SaaS Orchestrator

The SaaS Orchestrator is a cloud-based orchestration tool that provides centralized management of protected sites, protection groups, disaster recovery plans and runbooks. It simplifies the disaster recovery process by automating the execution of disaster recovery plans, enabling organizations to recover quickly and easily in a disaster.

SaaS Orchestrator allows administrators to create and manage disaster recovery plans, which are a series of steps and procedures that must be followed in a disaster. These plans can be customized to meet the specific needs of an organization and can include a variety of actions, such as powering virtual machines, recovery priority, and restoring backups.

With the SaaS Orchestrator, administrators can easily monitor and manage disaster recovery operations from a single console, providing a centralized view of their entire disaster recovery environment. This enables them to quickly identify and resolve any issues that may arise, ensuring their critical applications and data are available as soon as possible in a disaster.

The SaaS Orchestrator has several design considerations to ensure proper functionality and performance:

Design Consideration	Description
Network connectivity	The SaaS Orchestrator requires network connectivity to the DRaaS connectors at protected site, SCFS, and the vCenter Server. The network connection should have low latency and sufficient bandwidth to support the communication between the SaaS Orchestrator and these components.
Security	Security considerations are critical in a cloud-based service like the SaaS Orchestrator. VMware recommends that users configure firewalls to allow only necessary traffic and implement secure authentication mechanisms to prevent unauthorized access.
Integration	The SaaS Orchestrator must be integrated with other VMware Cloud Disaster Recovery components, including the SCFS and the vCenter Server. This integration ensures that recovery plans and operations are consistent and reliable.
Disaster recovery testing	The SaaS Orchestrator enables users to perform disaster recovery testing without disrupting production environments. Users should design their testing plans and processes to ensure that testing does not impact production environments and that it accurately reflects real-world scenarios.

Protected Site

The protected site in VMware Cloud Disaster Recovery can either be an on-premises vCenter or a VMware Cloud on AWS SDDC that is configured as the source environment where the VMs to be replicated are located. Once the protected site is established, the next step is to deploy a DRaaS connector on the source vCenter, which connects to the SCFS and the SaaS orchestrator.

Once a vCenter server is linked, VMs can be included in protection groups, which specify the VMs that will be replicated to the SCFS.

Refer “Before Setting Up a Protected Site” section for protected site design considerations.

DRaaS Connector

The DRaaS Connector manages the replication and synchronization of data between the on-premises or VMware Cloud on AWS SDDC (configured as a protected site) and the SCFS. This allows for fast and efficient replication of data, ensuring that the backup site is always up-to-date and ready to failover when required.

The DRaaS Connector is a lightweight appliance that allows for fast and secure data replication and synchronization from the protected site to the cloud file system.

The DRaaS Connector is deployed in the protected site which can be an on-premises environment or a VMware Cloud on AWS SDDC. The DRaaS Connector is also responsible for initiating and maintaining secure communication with the SaaS Orchestrator and SCFS. The DRaaS Connector communicates over a secure HTTPS connection, which is established using Transport Layer Security (TLS) protocol.

Below are a few considerations while sizing the DRaaS connector configuration for VMware Cloud Disaster Recovery:

Design Consideration	Implication
Sizing and scaling	The number of DRaaS connectors depends on the number of virtual machines that need to be protected and their sizes. Larger VMs may require more resources to be protected.
Replication frequency	The frequency of replication can affect the ideal number of the DRaaS Connector, as more frequent replication can require more resources.
Network bandwidth	The performance of the DRaaS Connector in VMware Cloud Disaster Recovery (VCDR) can be influenced by the available network bandwidth between the protected site and the cloud backup site. It's important to take into account the network bandwidth between these sites, as well as the maximum achievable bandwidth of each DRaaS Connector, when designing the network infrastructure.
Recovery Time Objective (RTO) and Recovery Point Objective (RPO)	While RPO requirements of the organization are not directly related to the number of DRaaS connectors, increasing replication requirements require additional connectors.
Hardware specifications	The ESXi host specifications on the protected site where the DRaaS Connector is deployed can also impact replication performance and sizing requirements. Refer to DRaaS Connector System and Network requirements documentation here.

Protection Group

A protection group is a grouping of one or more virtual machines and their associated policies, such as replication frequency and retention settings, within a logical container. It is important to note that all VMs included in a protection group should exist only on the same protected site. To create a protection group, an existing subscription with a cloud file system and a vSphere-protected site configured is required.

By using protection groups, the organization can manage replication and failover for their diverse types of applications separately. They can prioritize the recovery of their mission-critical applications, ensuring that they are brought online first in the event of a disaster. This approach can help organizations to minimize downtime and data loss during a disaster and maintain business continuity.

The following are the components of a protection group:

Protected Site selection (on-premises or VMware Cloud on AWS SDDC vCenter Server)
Group Membership (VMs)
Snapshot policies (schedule, retention)
Standard or high frequency snapshot selection
Guest OS quiescing

Even though a Protected Site can have up to 4 registered vCenters defining the available VM inventory, you cannot create a protection group that contains VMs from two different vCenter Servers. An organization can use protection groups in VMware Cloud Disaster Recovery to replicate and failover their virtual machines to a disaster recovery site.

For example, if an organization has two datacenters, one in New York and the other in London, with mission-critical and non-critical virtual machines running in both datacenters. To ensure that their critical applications are prioritized in the event of a disaster, the organization can create two protection groups: one for their mission-critical applications and another for non-critical applications.

The mission-critical protection group should include the most important virtual machines such as those running their ERP system, database servers, and other critical applications. The replication frequency should be high (more frequent) along with a longer retention policy to ensure that critical data is replicated frequently and retained for an adequate period. When constructing Recovery plans, this protection group should be set to be failed over first in the event of a disaster.

The non-critical protection group should include lower-priority virtual machines such as those running file servers, web servers, and other less critical applications. The organization can set a lower replication frequency and a shorter retention policy to conserve resources since these applications are less critical. In related Recovery plans, this protection group can be failed over after the mission-critical protection group has been brought online.

The previous example only considers some of the criteria that needs to be evaluated when creating protection groups. The following table outlines other criteria that should be considered to determine the appropriate settings for protection groups based on an organization’s specific requirements.

Design Considerations	Options	Implication
Site selection	On-premises or VMware Cloud on AWS SDDC vCenter Server	Based on the location of the virtual machines and their associated policies.
Members	Virtual machines	Based on their importance and criticality.
Policies for snapshots	Schedule, retention	Based on RPO objectives.
Cloud backup site	SCFS	Based on availability, scalability, and cost-effectiveness.
Replication frequency	High, standard	Based on the RPO SLAs of virtual machines.
Failover priority	Sequencing	Based on the importance and criticality of virtual machines.
Application type	Mission-critical or non-critical	Based on the importance of the application and its associated virtual machines.
RPO	30 minutes, 1 hour, 4 hours, 24 hours	Based on the amount of data that can be lost in case of a disaster.
RTO	1 hour, 4 hours, 12 hours, 24 hours	Based on the amount of time it takes to recover the virtual machines.
Retention policy	Days, weeks, or months	Based on the amount of data that needs to be retained for compliance and regulatory purposes.
Dependency	Yes or No.	Is another critical workload dependent on this virtual machine.

Recovery Site

The recovery site in VMware Cloud Disaster Recovery refers to the VMware Cloud on AWS SDDC where replicated VMs data on SCFS are mounted and recovered in the event of a disaster or protected site failure. The recovery site is designed to provide failover capabilities and ensure the continuity of critical applications and services. In VMware Cloud Disaster Recovery, there are two deployment models for the recovery site: Pilot Light and On-Demand. These models determine the level of resources and services provisioned at the recovery site. Refer to Deployment models for more detailed planning consideration.

Recovery Plan

In VMware Cloud Disaster Recovery, a recovery plan comprises a series of instructions that define the necessary steps to recover from a disaster and restore essential applications and systems. These plans automate the failover and failback processes, reducing the chances of human error and minimizing downtime. Multiple recovery plans at various stages can exist concurrently.

The recovery plan section in VMware Cloud Disaster Recovery offers various operations that allow administrators to manage and automate the recovery process.

Operations	Description
Create a recovery plan	Specify the virtual machines to be included in the plan, the order in which they should be powered on, and any custom scripts or commands that need to be run during the recovery process.
Configure failover settings	In addition, administrators can configure failover or failback settings for each protection group, including the inventory mapping, and custom network settings.
Create Failback Plan	Similarly, specify the failback settings for each protection group, including the order in which protection groups are failed back, and any custom network settings that need to be configured.
Test the recovery plan	Administrators can test the recovery plan using separate testing specific mappings in the recovery plan to avoid impact on production operations and to ensure that the virtual machines are powered on correctly and the custom scripts or commands are executed properly.
Executing recovery plan tests	Recovery plan tests can be manually run non-disruptively by DR administrators, ensuring regular testing and any issues can be resolved before a disaster takes place.
Activate DR Plans	Administrators can activate or deactivate a recovery plan at any time, allowing for flexibility in managing the recovery process.

The following outlines the configuration options available during Recovery Plan creation:

Recovery Plan Step	Description
Resource Mapping	This enables the mapping of inventory between the protected site and the recovery site. There is an option to create an individual mapping for the test and actual recovery or create a single mapping applicable in both cases.
Scripts	Custom scripts can be used in the recovery process to automate certain tasks such as restarting services, updating configurations, or executing other necessary actions.
Recovery steps	These are the steps that need to be taken during the recovery process. Examples of recovery steps include recovery of all or only few VM’s, inclusion of multiple protection groups, power actions on recovered VM, Pre and Post recover action on each VM.
IP (Internet Protocol) re-IP	IP re-IP, or IP address re-mapping, is the process of assigning new IP addresses to virtual machines after they have recovered at recovery site. This is necessary to ensure that the virtual machines can communicate with each other and with the external network.
Recovery priority	Recovery priority refers to the order in which virtual machines should be recovered during a disaster. Critical applications should be prioritized (ordered appropriately) to ensure that they are recovered first and are available as soon as possible.

Monitoring and Reports

VMware Cloud Disaster Recovery's monitoring and reporting features allow administrators to detect potential issues proactively, ensuring the overall environment is running smoothly and ensure compliance with RPO requirements. VMware Cloud Disaster Recovery enables administrators to generate PDF reports for test failover and Planned failover operations, as well as recovery plan configuration changes and compliance checks.

The following are the key reporting features available in VMware Cloud Disaster Recovery:

Failover and Failback reports
Configuration reports
Compliance reports

By leveraging these reporting features, administrators can obtain useful insights into their virtual environment, identify potential issues, and optimize their disaster recovery and business continuity processes. For example, a test failover run can help validate the configuration of the recovery plan. For more information on VMware Cloud Disaster Recovery's monitoring and reporting capabilities, please refer to the monitoring page.

Additionally, here are some key monitoring considerations for a VMware Cloud Disaster Recovery solution:

Monitoring consideration	Description
Replication Status	Monitor the replication status of protection groups and each virtual machine to ensure that data is being replicated to the disaster recovery site. If there are any issues with replication, they should be addressed immediately.
Failover Status	Monitor the failover status to ensure that virtual machines are being powered on at the disaster recovery site as expected. If there are any issues with failover, it should be addressed immediately.
RPO and RTO Compliance	Monitor the RPO and RTO to ensure that they are being met. Any deviations from these objectives should be investigated to identify the root cause.
Network Performance	Monitor the network performance between the primary and disaster recovery sites to ensure that it is sufficient to handle the replication and failover traffic.
Storage Capacity	Monitor the storage capacity at the disaster recovery site to ensure that it has sufficient space to store replicated data.
Application Availability	Monitor the availability of critical applications and services at both the primary and disaster recovery sites to ensure that they are accessible to end-users.
System Health	Monitor the health of the VMware Cloud Disaster Recovery system, including the SaaS Orchestrator, DRaaS Connector, and recovery plans, to identify any potential issues that could impact the replication or failover processes.

Compliance Checks

Continuous compliance checks are performed to ensure that the recovery plan remains valid and effective even when the protected or failover environment undergoes changes. These checks confirm that the protection groups specified in the recovery plan are active on the protected site and are being replicated correctly to the target SCFS.

Compliance checks run automatically every 30 minutes for activated plans. A plan can be out of compliance if any of its conditions become violated because of environmental (such as SCFS connected as NFS on the recovery SDDC) or plan configuration changes. You can generate and download these reports as a PDF or have them emailed on an automated schedule.

VMware Cloud Disaster Recovery provides below compliance checks:

Protected site checks	Ensures that the Protected site resources associated with the VMware Cloud Disaster Recovery environment meet the required connection, replication schedule, inventory, and configuration compliance.
Recovery site checks	Ensures that the Recovery site resources associated with the VMware Cloud Disaster Recovery environment meet the network connectivity, SCFS availability, inventory, and Recovery SDDC readiness.
Orchestration compliance checks	Ensures that the VMware Cloud Disaster Recovery plan is aligned with the IP mapping between sites, and recovery steps including pre or post-execution of script on VM configured for recovery.
Other compliance checks	Other checks include compliance validation of the VMC proxy VM (connection and health) state, VMC folder structure for file recovery and VMC refresh token validity.

These compliance checks help organizations build awareness on the VMware Cloud Disaster Recovery solution configuration state to maintain the integrity and availability of their IT infrastructure during a disaster or outage.

Settings

The settings page in VMware Cloud Disaster Recovery provides users with a centralized location to configure various settings related to their disaster recovery environment. Some of the settings that can be configured on the settings page include:

Settings	Configuration
API	API tokens can be obtained from VMware Cloud Services under the "My Account" section. This token is used to establish a secure connection between the VMware Cloud Disaster Recovery solution and the VMware Cloud on AWS SDDC to manage and automate failover and failback operations. By providing the VMware Cloud account token during the VMware Cloud Disaster Recovery setup process, users can automate the deployment of recovery SDDCs via the console and establish secure communication channels between the VMware Cloud Disaster Recovery solution and the vSphere environment for effective disaster recovery and business continuity management.
Email Alerts	Email alerts can be configured to notify administrators of various events related to disaster recovery. These alerts can be configured for events such as successful and unsuccessful failovers, replication failures, and data store issues (SCFS mounted as NFS on Recovery SDDC).
Security and Compliance	For improved security, the admin can restrict access to VMware Cloud Disaster Recovery using an IP address access list
Direct Connect (DX)	Configuration of Direct Connect for replication traffic from the Protected Site into the cloud file system is configured here. A Direct connect to the AWS account needs to be configured before enabling traffic flow via DX here.
Ransomware Recovery Services	VMware Cloud Disaster Recovery provides integrated Ransomware recovery service for replication enabled VMs. Integrated security and vulnerability analysis can be enabled/disabled here.
VMware Aria Operations for Logs	VMware Cloud Disaster Recovery enables integration with VMware Aria Operations for Logs to forward all events for audit and troubleshooting purposes. Configure the URL and key here to enable event forwarding.

Summary

VMware Cloud Disaster Recovery includes several components that need to be configured to ensure the proper functioning of the disaster recovery solution.

The table below presents a summary of the various user interface (UI) options, permissible actions, and configuration parameters:

UI	Action allowed	Configuration
Dashboard	Dashboard provides an overview of the deployment which includes: Recovery region summary Sites Topology Actions of quick setup are provided here, However, each of these configurations can be configured for the associated tiles.	The dashboard provides a glance at the overall status of the solution for a particular region. If the account has multiple regions enabled switch between the region for region-specific configuration. The summary provides health information about a particular region, the Connectivity status of Sites and a pictorial topology view.
Cloud file system	Creation of a file system for the enabled region.	Use additional cloud file systems to have separate failure domains, additional recovery SDDCs, and increase the total backup capacity. SCFS - that provides the immutable, off-site recovery points for effecting the desired site failover, managed by a separate SaaS Orchestrator UI running as a service in VMware Cloud.
Protected Sites	Configure on-premises or VMware Cloud on AWS as a protected site Specifying the cloud file system to be used Network connection configuration from a protected site (Internet or direct connect) Time zone	Protected sites and the policies that provide the coverage of your production workloads whether they are in on-premises datacenter locations or running in other VMware Cloud on AWS SDDCs.
Recovery SDDC	Create or attach an existing supported SDDC. Choose the Paired cloud file system	VMware Cloud on AWS Recovery SDDC – is the only supported recovery site for running workloads when the protected site has experienced a disaster. Note that there is only 1:1 mapping available from the cloud file system to recovery SDDC.
Protection Groups	Choose the protected sites Frequency of snapshot and retention Enable High frequency snapshot Quiescing Guest OS Add the VM based on query available	Protection groups are used to organize virtual machines and their associated policies for replication and failover. Additional configuration on frequency of snapshot and query based automated addition of VM into protection groups can be configured. Quiescing operation on the protected VM’s are only allowed during a standard snapshot.
Recovery Plan	Creation of Recovery Plan DR failover DR failover test Activate/Deactivate Compliance checks	By configuring a recovery plan in VMware Cloud Disaster Recovery, administrators can automate the failover reducing the risk of human error and minimizing downtime in the event of a disaster. Additionally, administrators are also allowed to created failback plan. Test these recovery plans and choose to activate or deactivate these plans based on the recovery requirements. Compliance checks are run regularly on active recovery plans to ensure the plan adheres to protected site, recovery site, orchestrator and other (proxy VM state, refresh token validity) requirements.
Monitor	SLA (Service Level Agreement) Status Events Alarm Tasks	Monitoring VMware Cloud Disaster Recovery is essential to ensure the replication and failover processes are running smoothly and to quickly identify and address any issues that may arise. Configuration or alarms alerts and automated reporting can be configured in the monitor section
Settings	API token Email alerts Security and compliance Direct connect Integration services (ransomware recovery and other)	By configuring a VMware Cloud on AWS account token, the VMware Cloud Disaster Recovery allows for the automated deployment of a recovery SDDC via the console. The email recipients and network configuration paths for replication are also set up in the UI's settings panel.