Designing a VMware Cloud on AWS Disaster Recovery Solution
Datacenter disaster recovery is a primary factor when planning an SDDC in a datacenter or on a cloud which hosts business-critical applications. Widespread use of virtualization has now moved datacenter availability up in the priority list and is major consideration when planning a new datacenter or migrating an existing datacenter to the cloud.
Disaster recovery is the quickest and the safest way, with minimal data loss, to get the business up and running when a disaster strikes. With a properly planned solution, we can not only perform recovery during a disaster but can also execute a planned failover and avoid the disaster altogether.
Need for Disaster Recovery
We hear this question often on why we need to have DR in place and when we already perform backup regularly. So let us understand the difference here between DR and a regular backup solution. The backup solution provides a way to recover from data loss only with multiple point-in-time restore options. However, in case of a data center failure, this solution is not dependable as we need a running SDDC to which the data would be recovered. The recovery data center plays an important role here.
The data from the primary or the protected data center is replicated to the disaster site continuously. The replicated data can be recovered quickly with automated recovery, and hence the time required to recover any application is much faster compared to the traditional way involved recovering each VM manually.
A disaster recovery solution also reduces administrative time and complexity, which is critical in cases of a rolling disaster such as a ransomware attack, for example.
Difference Between Business Continuity and Data Recovery
In addition to understanding the difference between a DR solution versus the traditional backup, it is worthwhile to also understand the concepts of business continuity and data recovery.
Business continuity is the process of planning a disaster recovery solution which includes the following:
- Analyzing the risk of current critical application
- Business Impact analysis of these applications in case of disaster
- Disaster site readiness
- Key Objectives such as RTO and RPO
- Role of virtualization in the recovery solution
When designing a recovery site you must account for the compute, storage, and network requirements that are necessary to keep the critical applications running while the production site is being recovered. With virtualization it is now possible to use the recovery site in a distributed mode, which means the resources in the recovery site do not need to sit idle all time. Depending on the available resources, you can choose to run critical/non-critical workload onto the recovery site.
Note - If you are running the recovery site in distributed mode, you must also account for additional storage resources that will be required for saving the replicated data.
Let's look at different use cases of DR with virtualization.
- Disaster recovery - This is a primary use case of the solution when a disaster strikes. Automated recovery using a predefined recovery plan which consists of placement groups (groups of dependent applications) in a particular order and a recovery plan(order of workload recovery).
- Disaster avoidance –Some situations you know that can potentially affect your datacenters like a storm approaching, flood, or a datacenter power maintenance. Now it is always better to avoid than react during a disaster. Using the pre-defined recovery plan we can perform failover even beforehand. This is technical avoiding the last-minute recovery and allows us to plan the failover during non-productive hours to a recovery site.
- Running non-critical workload on DR site - The recovery site is also another datacenter residing in a different geographical location providing the ability to recover for natural calamities. While this can be used as recovery site you would want to utilize the resources we are already paying for. With proper planning, you can also utilize these resources to run the non-critical workloads in the recovery site. This will help reduce the load on the production datacenter.
- Upgrade and patch testing - Imagine if you can test the Patches on a replicated production workload, which otherwise may cause issues when applied directly on the Production. With a Virtualized DR solution, the test failover can bring the most recent state of replicated production VM in a test network on a DR site enabling us to apply the patch/upgrade and validate any effects before being applied to the production workload.
- Ransomware attacks - One of the recent threats to any workload is the threat of ransomware. There are two ways to get out of situation:
- Pay the attacker with no guarantee of recovering
- Restore from backup from the last know good backup state , which takes a long time and result in loss of data.
With replication enabled on the VM, we can choose to bring up the VM with the latest replicated data before the attack. This avoids spending time on restoring from backup and also data loss depending on the RPO set on the virtual machine.
Selecting the Solution
To replicate the data from one site to another, there are two solutions that are currently being used.
- Storage-based replication
- Hypervisor-based replication
Traditional recovery methods involved setting up two sites and enabling storage-based replication between these two sites. Replicating data was completely handled by the storage array. LUN created on the storage array was configured to be replicated over to the secondary site. During disaster recovery, a snapshot of the replicated LUN was created on the recovery site and mounted manually to use the data. This solution lacked automated recovery.
VMware Site Recovery Manager filled this gap of automated recovery with the storage array plugin. Not only the process of creating array snapshots and mounting them was automated, but this also provided additional granularity on the recovery of the virtual machine. You could now create protection groups to group dependent workloads, create a recovery plan, and set a priority order for recovery of workloads. However, what lacked was choosing to just replicate an individual VM over a complete LUN. Placement of VM on a particular LUN was to be designed based on the need for replication.
Storage-based replication also requires storage hardware on the production and recovery site. These requirements forced users to use one particular storage vendor in both sites along with an additional replication license which increased the overall cost for the DR solution.
VMware introduced Hypervisor/Host-based replication (HBR) to overcome the challenges of storage- based replication. With HBR, you can replicate the data of an individual virtual machine from one site to another. HBR provides the granularity and flexibility of enabling replication on individual virtual machines residing on any supported storage for the ESXi, which addressed the major challenge of the storage agnostic solution. For example, you may choose to have iSCSI or Fibre-attach storage with VMFS on the production site and choose an NFS or VSAN based solution on the recovery site.
The hypervisor at the production site where the virtual machine is running can perform the initial replication (first full copy) and also track the changes and replicate them over to the recovery site. This is done with the help of a paired replication appliance deployed on both sites. An HBR-enabled machine can also utilize virtualization features such a HA or vMotion.
Designing the Solution
When designing a recovery solution, there are multiple factors to be considered covering all the components of the datacenter.
VMware Provides two offerings based for Disaster recovery solution with VMC on AWS.
- Using VMware Site Recovery Manager: This is an on-demand DRaaS solution that is delivered with vSphere replication and VMware Site Recovery Manager. With this service, along with enabling the add on from the VMC on AWS UI, you deploy a vSphere appliance in your on-premises vSphere environment. You can then Pair sites and replicate your critical VMs running in the on-premises environment to an SDDC created on VMC on AWS.
- Using VMware Cloud Disaster Recovery : This is VMware's on-demand disaster recovery service that is delivered as an easy-to-use SaaS solution and offers cloud economics to help keep your disaster recovery costs under control. The target SDDC can be created immediately prior to performing a recovery and not upfront, while also supporting the replications in the steady state. The DRaaS connector is deployed as a virtual appliance that replicates the data to a Scale-Out Cloud File System (SCFS). This volume is mounted when we choose to perform a recovery as livemount on SDDC and since the VM are already in ESXi supported format recovery is handled at ease.
When you are selecting an SDDC as a recovery site, you must cover these major categories of an SDDC.
- When sizing the DR site you must analyze the current infrastructure that is deployed and list all resources that need to be protected.
- When you have multiple on-prem datacenters, design the solution to either use one dedicated SDDC as a recovery site or plan to use in dual mode (each site as production and recovery site)
- Deployment Type
- Single Host SDDC
- On-demand (also known as "just-in-time")
- Pilot Light with cloud bursting
See the table below for detailed information:
Annual SDDC commitment
Available with Solution
SDDC startup time
Single Host SDDC
Single host SDDC deployed used for building and testing DR plans, after which the recovery SDDC can be deleted to save recurring costs.
A single host SDDC does not provide any data protection does not offer production-level SLAs, and will automatically be torn down in 30 days. A single-host deployment should only be used for testing purposes and is not intended for production usage.
Only when needed
Minimum 3 VMware ESXI on cloud on AWS hosted at all times or 2 Host SDDC with a EC2 instance as Witness.(For more details, see this blog )
vCloud DR with vSphere Replication and Site Recovery Manager
- Recovery Time Objective (RTO): The recovery time objective (RTO) is the targeted duration of time and a service level in which a business process must be restored as a result of an IT service or data loss issue, such as a natural disaster. RTO is one important category and the earlier table can be used to determine the offering to choose as a DR solution
- Recovery Point Objective: RPO defines the maximum acceptable age that the data stored and recovered in the replicated copy (replica) as a result of an IT service or data loss issue, such as a natural disaster, can have. The lower the RPO, the closer the replica's data is to the original. However, lower RPO requires more bandwidth between the source and target locations, and more storage capacity in the target location depending on the Point-in-time configured on VM.
- Point-in-Time Instance: You define multiple recovery points (point-in-time instances or PIT instances) for each virtual machine so that when a virtual machine has data corruption, data integrity, or host OS infections, administrators can recover and revert to a recovery point before the compromising issue occurred.
- Accounting for overhead using each datacenter in distributed mode:
- Include the snapshot and swap space
- Add vSphere Replication appliance overhead depending on the number of VM’s configured for replication
- Network Connectivity Considerations
- Replication Objectives like RTO and RPO configuration are dependent on network
- Network Compression
- ISP selection, network bandwidth, and redundancy
- IP migration if any (public IP)
- name record update
- Perform Inventory mapping
- VM resource pool and folder Inventory mapping
- Datastore mapping
- Network mapping
- Swap datastore configuration
The primary considerations for DR recovery plan are defined here. For planning information, refer to DR Planning document.
- Limits on recovery plans, protection groups, etc - Discuss how this affects the overall design and how to best optimize.
- Limits on concurrent recoveries to avoid burst mode- Discuss how this impacts DR events and offer strategies for prioritization of recoveries during a DR event.
- Restart Priority and Recovery Order
- Split-brain breaker(witness)
- Consider roles and permission when a dedicated user is used to execute steps on the shared service during recovery. Example - update a DNS record during recovery.
- Manage how shared services such as DNS, DHCP, and domain authentication are handled
- Configure permission and roles for recovery management.