Well-Architected Design: VMware Cloud Disaster Recovery Operations

Introduction

VMware Cloud Disaster Recovery operations refer to the processes and procedures involved in ensuring the continuity and recovery of critical applications, data, and services in vSphere based on-prem or VMware Cloud on AWS SDDC environments during a disaster or disruptive event. These operations are designed to minimize downtime, protect data integrity, and restore normal business operations as quickly as possible.

Scope of the Document

This design provides an overview of the key considerations, processes, and guidelines related to the management and execution of VMware Cloud Disaster Recovery operations.

Summary and Consideration

Use Case	Preemptive actions for disaster avoidance. Readiness planning for disaster recovery. Disaster recovery solution validation.
Pre-requisites	VMware Cloud on AWS subscription. VMware Cloud Disaster Recovery subscription. Internet or Direct Connect connection from the protected site location to the cloud file system. Ransomware recovery enabled VMs must have VMware tools installed. Refer to the VMware Cloud Disaster Recovery Planning and preparation guide and components design guide.
General Considerations/Recommendations	Failover and Failback Procedures: Develop detailed failover and failback procedures. This should outline the steps, responsibilities, and dependencies involved in transitioning from the protected site to the recovery site during a disaster event and returning to normal operations afterward. Testing and Validation: Regularly test and validate your VMware Cloud Disaster Recovery solution to ensure its readiness and effectiveness. Conduct both planned and unplanned tests to simulate various disaster scenarios and assess the ability to meet recovery point objective (RPO) and recovery time objective (RTO) targets. Monitoring and Reporting: Establish robust monitoring capabilities to track the health and performance of your VMware Cloud Disaster Recovery infrastructure and operations. Implement comprehensive reporting mechanisms to provide visibility into the status and compliance of your VMware Cloud Disaster Recovery solution. Regular Maintenance and Updates: Regularly review and refine your recovery plans to incorporate lessons learned from tests, technology advancements, and changes in business requirements. Scalability: Ensure that your VMware Cloud Disaster Recovery solution is scalable to accommodate future growth and evolving workload demands. Scalability considerations include the ability to scale the network, storage, and compute resources at both the protected and recovery sites. Plan for potential increases in SCFS, DRaaS connectors based depending on increasing replication.
Performance Considerations	Storage Performance: Evaluate the performance capabilities of your storage infrastructure at both the protected and recovery sites. Ensure that the storage systems can handle the workload demands during replication and failover processes. Consider factors such as Input/Output Operations Per Second (IOPS, throughput, and latency of the storage arrays to meet the performance requirements of critical workloads. Compute Resources: Ensure that the recovery site has sufficient compute resources, such as CPU and memory, to support the failover of critical workloads. Adequate compute capacity helps maintain the required performance levels during the recovery phase. Monitor and allocate resources effectively to prevent resource contention and performance degradation.
Network Considerations/Recommendations	Bandwidth: Assess the available bandwidth between the protected and recovery sites. Sufficient bandwidth is crucial for timely data replication and synchronization between the sites. Consider the volume of data to be replicated, the frequency of replication, and the desired RPO when evaluating bandwidth requirements. Network Latency: Minimize network latency between the protected and recovery sites to ensure optimal data transfer performance. Higher latency can impact the speed of data replication and recovery operations.
Cost Implications	Please see the VMware Cloud Disaster Recovery pricing page for further information on the costs involved.
Document Reference
Last Updated	April 2023

Operations

Protection Group lifecycle

Protection group management and lifecycle refer to the processes and activities involved in creating, configuring, and maintaining protection groups in a disaster recovery solution.

A protection group is a logical grouping of virtual machines (VMs) that share similar recovery requirements. It allows you to apply consistent policies and settings to multiple VMs collectively. Here are the key aspects of protection group management and lifecycle:

Creation	The first step is to create protection groups based on your recovery needs. New protection group creation requests can arise depending on the workload changes on the protected site.
Configuration	Once created, protection groups can be configured to fine-tune their behavior. This includes adjusting replication frequencies and modifying retention policies.
Monitoring	Ongoing monitoring of protection groups is crucial to ensure their effectiveness. This involves regularly checking the replication status, reviewing RPO compliance, and monitoring any alerts or notifications related to the protected VMs.
Maintenance	Protection groups regular maintenance includes periodically reviewing and updating the group membership as new VMs are added or removed from the environment. It also involves reviewing and updating the replication and recovery settings to align with changing business requirements.
Testing and validation	Regular testing and validation of protection groups are essential to ensure the recoverability of VMs in a disaster scenario. This involves performing planned failover tests to verify the replication process, validate the recovery procedures, and assess the overall readiness of the protection groups.
Decommissioning	If VMs or protection groups are no longer required, a proper decommissioning processes should be followed. This involves removing VMs from the protection groups, stopping replication, and cleaning up any associated resources.

Recovery Plan lifecycle

The recovery plan lifecycle in VMware Cloud Disaster Recovery involves several operations that enable organizations to manage their disaster recovery processes effectively. These lifecycle operations provide organizations with the flexibility to create, test, configure, and activate recovery plans as part of their overall disaster recovery strategy. Here are the key aspects of recovery plan management and lifecycle:

Creation	Organizations can create recovery plans that define the steps and procedures to be followed during a disaster recovery event. This involves specifying the protection groups to be included in the plan, the order in which they should be powered on, and any custom scripts or commands that need to be executed.
Configuration	Administrators have the ability to configure failover and failback settings within the recovery plans. This includes specifying the order in which protection groups are failed over or failed back, setting limits on the maximum number of virtual machines that can be failed over at once, and configuring custom network settings as required.
Compliance validation	Compliance status is an indication for administrators to monitor the compliance state of a recovery plan. Any recovery plan which is out of compliance needs to be verified and fixed for any issue reported.
Testing	Recovery plans can be tested to ensure their effectiveness and validate the recovery procedures. Administrators can simulate a disaster event and execute the recovery plan in a controlled environment. This allows them to verify that virtual machines are powered on in the correct order and that all custom scripts and commands are executed as intended.
Scheduling	Recovery plan tests can be scheduled at regular intervals. This ensures that the plans are regularly tested, helping to identify any issues or gaps in the recovery process. Scheduled testing helps organizations maintain confidence in their disaster recovery capabilities and provides an opportunity to address any potential issues proactively.
Activation	Recovery plans can be activated or deactivated based on the organization's requirements. When a recovery plan is activated, it is ready to be executed during an actual disaster event. Conversely, deactivating a recovery plan temporarily disables its execution, allowing administrators to make updates or changes as needed.

Failover

VMware Cloud Disaster Recovery failover is the process of shifting operations from a protected site to a recovery site in the event of a disaster or outage. During a failover event, VMs that are protected by VMware Cloud Disaster Recovery are powered on at the recovery site, to ensure continued access to critical applications and data.

Here is an overview of the VMware Cloud Disaster Recovery failover process:

Failover process	Description
Detection of a disaster or outage	The administrator detects a disaster or outage at the protected site, triggering the failover process. This step is skipped during a planned failover.
Verify protection group status	During a planned failover, administrators should verify the status of the protection group to ensure that all VMs are properly protected and up to date.
Initiate failover	The failover process is initiated manually depending on the configuration of the VMware Cloud Disaster Recovery protection groups.
Power on virtual machines	Once failover is initiated, VMs are powered on at the recovery site according to the configuration of the recovery plan. This may involve powering on VMs in a specific order, running custom scripts or commands, or configuring network settings.
Network configuration	A situation where manual network configuration change is needed to ensure that the virtual machines are accessible at the secondary site.
Validation	After the virtual machines are powered on at the secondary site, administrators should verify that they are functioning properly and that users can access critical applications and data.

Test failover

A "Test failover" is a simulated failover event that allows administrators to verify the effectiveness of their VMware Cloud Disaster Recovery protection and failover processes without causing disruption to production environments.

Note the following:

Testing a recovery plan is only allowed for a recovery plan which is active.
The configuration of inventory mappings for the test and DR failover can be adjusted to be the same or different, depending on the checkbox option during the Recovery plan creation.
A test failover will not replicate changes on the test recovered VM back to the production site.

Below are considerations for Test Failover:

Design Consideration	Descriptions
Objectives	Compare the defined objectives of the test failover, such as validating disaster recovery plans, testing application functionality, or assessing system performance.
Scope	Determine the scope of the test failover, including which systems, applications, or services will be involved in the test.
Scheduling	Select an appropriate timing for the test failover, considering factors like business operations, user impact, and system availability.
Inventory Mapping	Ensure that inventory mapping for test for not coincide with production workloads.
Documentation	Document the test failover plan, including the steps to be followed, expected outcomes, and any specific configurations or settings to be applied during the test.
Rollback Plan	Develop a rollback plan in case the test failover encounters issues or unexpected results. Ensure a clean-up before executing the recovery plan.
Monitoring and Validation	Set up monitoring tools and processes to track the progress and performance of the test failover. Validate the performance is in line with the RTO requirements.
Lessons Learned	After the test failover, conduct a thorough evaluation and capture lessons learned. Identify areas of improvement and make necessary adjustments to enhance the effectiveness of future test failover exercises.

Planned failover or DR failover

Planned failover and DR failover are two different scenarios in a disaster recovery solution. Here is how they can be defined:

Planned Failover: A planned failover is a controlled and scheduled process where the failover from the protected site to the recovery site is initiated intentionally. It is typically performed when there is a need to perform maintenance activities or tests on the protected site, or when there is an anticipated disruption or event that may impact the protected site's availability.

During a planned failover, the VMs are manually powered off or suspended on the protected site, and then powered on or resumed on the recovery site. This process ensures that there is minimal to no data loss, and the failover is executed in an orderly manner according to predefined procedures.

Planned failover allows organizations to proactively switch their operations to the recovery site to maintain business continuity during planned maintenance, upgrades, or other events that may temporarily affect the protected site.

DR Failover: DR failover, on the other hand, is an unplanned event triggered by a disaster or an unexpected disruption that renders the protected site unavailable. It occurs when the protected site experiences a failure, such as a natural disaster, power outage, hardware failure, or any other event that makes the protected site inaccessible or inoperable.

During a DR failover, the recovery site takes over the operations, and the VMs are powered on to ensure business continuity. The failover process is initiated manually in response to the protected site's failure. Data loss may occur, depending on the RPO (Recovery Point Objective) defined for the solution and the time elapsed since the last replication or backup.

DR failover is designed to swiftly transition critical workloads to the recovery site to minimize downtime and restore services in the face of an unforeseen event or disaster.

Failback

After the protected site has been restored, administrators can begin the failback process to transition back to the protected site. This process involves powering off virtual machines at the secondary site and synchronizing any changes that occurred during the failover period.

In VMware Cloud Disaster Recovery, you can execute a DR Plan to perform failback from a VMware Cloud on AWS SDDC to a protected vSphere site. The failback process from an SDDC only transfers changed data, without requiring rehydration, and the data remains in its original compressed and deduplicated form.

The failback process from VMware Cloud on AWS involves a series of steps, which are as follows:

Shut down the VMs on the VMware Cloud on AWS SDDC.
Take a final snapshot of the VMs after they are shut down. The differences between the state of the VMs at the time of recovery and failback are then applied to the snapshot that was used for recovery to create a VM backup on the SCFS for future retrieval.
Retrieve these VM backups to an on-premises system using a general forever incremental protocol.
Recover the VMs to a protected vSphere site.
Once the recovery is successful, the VMs are automatically deleted from the SDDC.

A failback DR plan is created by duplicating the recovery plan and reversing its steps. The new failback plan operates the same way as any other recovery plan. You can edit the plan to change the destination site to point to a new VMware Cloud Disaster Recovery protected site. Or you can change the vCenter mapping if the failback target site has more than one protected site.

Alternatively, you can configure a new protected site or vCenter for failback if the appropriate mappings are in place. However, incremental recovery may not be possible in this case.

When VMware Cloud Disaster Recovery can locate a VM with the same instance UUID, an incremental recovery is carried out. If the instance UUID cannot be found, a full recovery is initiated instead.

Considerations for Failback in VMware Cloud Disaster Recovery:

Design Considerations	Description
Data Consistency	Ensure that the data at the recovery site is consistent before initiating the failback process.
Application Dependencies	Identify and address any application dependencies that may have changed during the failover. This includes re-establishing connectivity and dependencies with other systems or services.
Network Configuration	Verify that the network configuration at the production site is restored to support the failback process. This includes updating IP addresses, DNS settings, routing, and security system rules as required.
Testing and Validation	Conduct thorough testing and validation of the failback process before executing it in a production environment. This helps identify and resolve any issues or discrepancies that may arise during the failback.
Monitoring and Troubleshooting	Continuously monitor the failback progress and performance and be prepared to address any issues that may arise. This includes closely monitoring application functionality, network connectivity, and system performance during the failback.
Documentation and Lessons Learned	Document the failback process, including any challenges faced, solutions implemented, and lessons learned. This information can be valuable for future disaster recovery planning and continuous improvement efforts.
Post-Failback Testing	Once the failback is complete, perform comprehensive testing to ensure that all systems and applications are functioning as expected in the production environment.

Restore a VM

VMware Cloud Disaster Recovery provides a process for restoring VMs and their associated data from backups stored on the recovery site. The restored VM will be returned to the same state it was in when the snapshot was taken, including its vCenter Server location, configuration, and data, among other things.

Here is an overview of the process for restoring a VM to the protected site from backup:

Identify the failed VM(s)
Initiate the restoration process
Verify the restored VM(s)

By restoring VMs from backups stored on the cloud file system, administrators can ensure that their critical applications and data are available in the event of a disaster or outage.

Important Note: The restoration process provided in VMware Cloud Disaster Recovery should not be considered a substitute for backups of your workloads. It is strongly recommended to perform regular backups of your VMs independently.

Inventory and Resource Mapping 

Inventory and resource mapping are important aspects of VMware Cloud Disaster Recovery.

In terms of inventory, VMware Cloud Disaster Recovery allows you to select and protect VMs in your on-premises vSphere environment or any other supported protected site. You can choose to protect entire data centers or specific resource pools or VMs. Once the initial replication is complete, you can manage the protected inventory from the VMware Cloud Disaster Recovery console.

Resource mapping is the process of mapping protected site resources to their equivalent resources in the VMware Cloud on AWS recovery SDDC. This includes mapping vSphere clusters, hosts, datastores, and networks to their equivalent constructs in VMware Cloud on AWS SDDC.

vCenter Server Mapping

When mapping vCenter Server in a disaster recovery plan, choosing a target vCenter Server or a recovery SDDC is a straightforward task as each SDDC consists of a single vCenter Server instance. However, for VMware Cloud Disaster Recovery, it is essential to note that while a protected site may have several registered vCenter Servers, only one vCenter Server on VMware Cloud on AWS can be mapped per disaster recovery plan.

The wizard displays a subset of the vCenter Server object inventory for both the source and target vCenter Servers. Source vCenter Server object nodes that are detected to contain protected VMs are required to be mapped and are displayed in the UI with blue text. All other mappings are optional.

Note: If your VMs on the protected vSphere Server site have tags associated with them, make sure that the same sets of tags and tag categories also exist on the recovery site of the disaster recovery plan.
Tip: Avoid having other VMs in target folders because name conflicts can arise when registering VMs with vCenter Server.

Datastore Mapping

In VMware Cloud Disaster Recovery, the process of establishing datastore mapping involves associating the source datastores of your protected VMs with the target datastores at the recovery site.

By correctly configuring datastore mappings, VMware Cloud Disaster Recovery can efficiently replicate and synchronize your VMs' data, allowing for a smooth failover process in the event of a disaster or disruption.

IP Mapping

VMware Cloud Disaster Recovery utilizes IP mappings to determine the assigned IP addresses of a VM during failover from a protected site to the recovery site. The IP addresses that will be used for the recovered VMs need to be specified to ensure proper recovery.

IP address mappings can be set up for VMs running on Linux and Windows operating systems. Target IP, subnet mask, gateways, and DNS servers will be displayed for VMs configured with IP address mappings.

Important: To map IP addresses for Windows VMs, the system drive of the VMs must be mapped to c:\. Additionally, the mapped c:\ drive cannot be dynamic volume; it must be a basic disk.
Note: VMware Tools must be installed on the guest OS to ensure successful IP address mapping. Only iPv4 is supported for protection plan IP address mapping. This means that any VMs referenced in a disaster recovery plan must be using iPv4, or the IP address mapping will not work.

Inventory

Resources Mapping allowed

Individual IP Address Mapping

The following options are available on IP address mapping page:

Optional rule description
Source and target IP addresses
Source and target subnet masks
Source and target gateways
Source and target DNS servers

Entries for gateways and DNS servers must be separated by white spaces. If multiple IP addresses are specified, they will be matched in the specified order from source to target.

IP Address Range Mapping

Alternatively, you can configure IP address ranges rather than individual IP addresses. Switching to IP ranges can be done by selecting Range from Range/IP addresses, as shown below:
Limitations when mapping IP address ranges:

You can provide a bit's value that is smaller than the subnet mask size (CIDR prefix). For instance, if the subnet is a /20 you can define a CIDR prefix (bits) that provides a smaller IP range (i.e., /21, /22, etc.) for the range mapping.

You cannot, however, do the reverse. If the subnet is a /20 you cannot enter a CIDR prefix (bits) that provides a greater IP range (i.e., /19, /18, etc.) for the range mapping.

Performance checks

Regular system health checks are conducted to identify any issues, and performance optimization measures are implemented to enhance efficiency. By employing comprehensive performance checks, organizations can proactively address bottlenecks, adhere to service level agreements, and maintain reliable disaster recovery capabilities. Performance checks play a vital role in managing VMware Cloud Disaster Recovery effectively. Performance checks encompass various aspects referred to in DRaaS connector Performance checks.

CPU and memory utilization of the DRaaS connector are monitored to identify any resource limitations. Additionally, VM and its application performance are examined to guarantee optimal operation during normal and failover scenarios.
Replication performance is monitored, focusing on factors like bandwidth, latency, and throughput between the protected and recovery sites.
Network performance is also evaluated by assessing connectivity, latency, packet loss, and bandwidth utilization.
Storage performance is another crucial area, involving monitoring disk I/O, read/write latencies, and throughput to meet recovery objectives.

Scalability

It is essential to consider scalability to ensure optimal performance and accommodate future growth of replicated workloads. The scalability of the DRaaS Connector allows you to handle increased workloads and maintain efficient replication processes.

Here are some post-deployment considerations for scaling your DRaaS Connector:

Evaluate Workload Growth	Regularly assess your workload requirements and anticipate potential growth. Determine if the existing DRaaS Connector can handle the increased workload or if additional resources are needed.
Resource Monitoring	Monitor the resource utilization of the DRaaS Connector, including CPU, memory, and disk usage. This will help identify any resource constraints.
Scaling	Deploy additional instances of the DRaaS Connector to distribute the workload and improve performance. This can involve deploying multiple DRaaS Connectors in parallel to handle the replication of different sets of virtual machines or protection groups. Use vSphere Anti-Affinity rules to keep DRaaS Connectors separate across physical hosts.
Network Bandwidth	Ensure that the network bandwidth between the protected site and the recovery site is sufficient to handle the increased workload. Evaluate the available bandwidth and consider upgrading your network infrastructure if needed.
Continuous Monitoring	Implement a robust monitoring solution to track the performance and health of the DRaaS Connector. Monitor key metrics such as replication status, latency, and throughput to ensure smooth operations and address any issues promptly.

Upgrades

VMware Cloud Disaster Recovery is a SaaS (Software-as-a-Service) solution provided by VMware. As a SaaS solution, lifecycle management is handled by VMware, removing the burden on organizations to manage and maintaining the underlying infrastructure. As part of the upgrade process, the DRaaS connector is also automatically upgraded, ensuring that organizations benefit from the latest features, performance improvements, and bug fixes. Details can be found in the Upgrade process documentation.

Hardening/ Security

It is important to continuously review and update the security measures in place for the VMware Cloud Disaster Recovery service to address evolving threats and vulnerabilities. Regularly assess the security posture, perform risk assessments, and collaborate with security experts or VMware support to enhance the overall security of the VMware Cloud Disaster Recovery service.

Here are some considerations for hardening the VMware Cloud Disaster Recovery service:

Design Considerations	Description
Access Control	Implement strong access controls to limit administrative privileges and restrict access to authorized personnel only. Enforce strict password policies, implement Connector and Management access list, and regularly review and revoke unnecessary privileges. Refer to Configure Access to the Service
Updating API token	Configure API token to define its scope of permissions by assigning specific organization roles and service roles. Refresh the token on a regular interval with defined guidelines of security team
Configure vCenter registration	If a restricted user account is configured on VMware Cloud Disaster Recovery service to connect vCenter Server on the protected site, depending on security policies requirements the vCenter Server password needs to be updated at a set interval. Update the credential on the protected site registration page once the user credentials have changed. Refer to Updating vCenter credentials.
Network Segmentation	Utilize network segmentation techniques to isolate the VMware Cloud Disaster Recovery service from other networks and ensure that only necessary network traffic is allowed. Configure Access list for allowed IP address or ranges of deployed connected. Refer to configure Access here.
Encrypted connection	The DRaaS connectors communicate with the VMware Cloud Disaster Recovery SaaS orchestrator over an encrypted tunnel across the internet or Direct Connect for data transfer and metadata operations utilizing port 443. Refer to DRaaS Connector Connectivity Check and validate connectivity between the DRaaS Connector, Orchestrator, cloud file system, auto-support server, and the protected site vCenter Server and ESXi hosts.
Monitoring and Logging	Implement robust monitoring and logging solutions to track and detect any suspicious activities or anomalies. Monitor network traffic, system logs, and security events to identify potential security breaches or unauthorized access attempts.
Compliance and Regulatory Requirements	Ensure that the VMware Cloud Disaster Recovery service adheres to relevant compliance and regulatory requirements specific to your organization's industry. VMware Cloud Disaster Recovery is now compatible with compliance hardening requirements for Payment Card Industry Data Security Standard (PCI DSS).

Monitoring reporting and alerting

When designing the monitoring for VMware Cloud Disaster Recovery, you can establish an effective monitoring solution by enabling proactive management, quick issue detection, and efficient recovery operations.

Here are some key design considerations:

Design Considerations	Description
Monitoring Tasks and events	Use VMware Cloud Disaster Recovery UI to look at current and historical tasks, alarms, and events. This provides visibility into various components and metrics. Configure Events forwarding into vRealize Log Insight Cloud,
Integration with Existing Monitoring Infrastructure	Ensure seamless integration of VMware Cloud Disaster Recovery monitoring with your existing monitoring infrastructure. This includes integrating with your existing monitoring systems, email alerting, and ticketing systems for centralized monitoring and management.
Key Metrics and Alerts	Identify the critical metrics that need to be monitored to ensure the health and performance of VMware Cloud Disaster Recovery. Refer to replication progress statistics data.
Performance and Capacity Monitoring	Monitor the performance and capacity of the VMware Cloud Disaster Recovery infrastructure components such as DRaaS connectors and storage systems. Monitor key performance indicators (KPIs) such as CPU usage, memory utilization, network latency, storage throughput, and latency to identify bottlenecks or performance degradation.
Email alerting	Configure email alerts to receive status change information regarding SLA and Recovery plan compliance and execution status.
Compliance and Reporting	Generate regular reports on the status, performance, and compliance of the VMware Cloud Disaster Recovery environment for auditing purposes. Use SLA status to get a high-level operations status for your critical workload's protection and recoverability.

Well-Architected Design: VMware Cloud Disaster Recovery Operations

Introduction

Scope of the Document

Summary and Consideration

Operations

Protection Group lifecycle

Recovery Plan lifecycle

Failover

Test failover

Planned failover or DR failover

Failback

Restore a VM

Inventory and Resource Mapping

vCenter Server Mapping

Datastore Mapping

IP Mapping

Performance checks

Scalability

Upgrades

Hardening/ Security

Monitoring reporting and alerting

Filter Tags

Inventory and Resource Mapping