VMware Site Recovery Technical Overview
VMware Site Recovery brings VMware enterprise-class Software-Defined Data Center (SDDC) Disaster Recovery as a Service (DRaaS) to the AWS Cloud. It enables customers to protect and recover applications without the requirement for a dedicated secondary site. It is delivered, sold, supported, maintained and managed by VMware as an on-demand service. IT teams manage their cloud-based resources with familiar VMware tools—without the difficulties of learning new skills or utilizing new tools.
VMware Site Recovery is an add-on feature to VMware Cloud on AWS, powered by VMware Cloud Foundation. VMware Cloud on AWS integrates VMware’s flagship compute, storage, and network virtualization products — VMware vSphere, VMware vSAN, and VMware NSX®—along with VMware vCenter Server® management. It optimizes them to run on elastic, bare-metal AWS infrastructure. With the same architecture and operational experience on-premises and in the cloud, IT teams can now get instant business value via the AWS and VMware hybrid cloud experience.
The VMware Cloud on AWS solution enables customers to have the flexibility to treat their private cloud and public cloud as equal partners and to easily transfer workloads between them—for example, to move applications from DevTest to production or burst capacity. Users can leverage the global AWS footprint while getting the benefits of elastically scalable SDDC clusters, a single bill from VMware for its tightly integrated software plus AWS infrastructure, and on-demand or subscription services like VMware Site Recovery Service.
VMware Site Recovery extends VMware Cloud on AWS to provide a managed disaster recovery, disaster avoidance and non-disruptive testing capabilities to VMware customers without the need for a secondary site, or complex configuration.
VMware Site Recovery works in conjunction with VMware Site Recovery Manager 8.0 and VMware vSphere Replication 8.0 to automate the process of recovering, testing, re-protecting, and failing-back virtual machine workloads.
VMware Site Recovery utilizes VMware Site Recovery Manager servers to coordinate the operations of the VMware SDDC. This is so that as virtual machines at the protected site are shut down, copies of these virtual machines at the recovery site startup. By using the data replicated from the protected site these virtual machines assume responsibility for providing the same services.
VMware Site Recovery can be used between a customer’s datacenter and an SDDC deployed on VMware Cloud on AWS or it can be used between two SDDCs deployed to different AWS availability zones or regions. The second option allows VMware Site Recovery to provide a fully VMware managed and maintained Disaster Recovery solution.
Migration of protected inventory and services from one site to the other is controlled by a recovery plan that specifies the order in which virtual machines are shut down and started up, the resource pools to which they are allocated, and the networks they can access. VMware Site Recovery enables the testing of recovery plans, using a temporary copy of the replicated data, and isolated networks in a way that does not disrupt ongoing operations at either site. Multiple recovery plans can be configured to migrate individual applications or entire sites providing finer control over what virtual machines are failed over and failed back. This also enables flexible testing schedules.
VMware Site Recovery extends the feature set of the virtual infrastructure platform to provide for rapid business continuity through partial or complete site failures.
Features and Benefits of VMware Site Recovery
- Provides familiar features and functionality with enhanced workflows to reduce time to protection and risk
- An easy to use disaster recovery/secondary site that is supported and maintained by VMware. This lowers capital costs and makes it easier to protect more virtual machines faster.
- Application-agnostic protection eliminates the need for app-specific point solutions
- Automated orchestration of site failover and failback with a single-click reduces recovery times
- Frequent, non-disruptive testing of recovery plans ensures highly predictable recovery objectives
- Enhanced, easy to use, consolidated protection workflow simplifies replicating and protecting virtual machines
- Centralized management of recovery plans from the vSphere Web Client replaces manual runbooks
- vSphere Replication integration delivers VM-centric, replication that eliminates dependence on a particular type of storage
- Flexible versioning allows for easier upgrades and ongoing management
VMware Site Recovery is deployed in a paired configuration, for example, protected/customer site and recovery/VMware Cloud on AWS site. This document will use the two terms interchangeably because either the customer site or VMware Cloud on AWS site can be either the protected or recovery sites.
VMware Site Recovery utilizes VMware Site Recovery Manager and vSphere Replication. For the VMware Cloud on AWS instance, this software is automatically installed and configured by VMware when the add-on is enabled.
For the customer site, VMware Site Recovery Manager is installed on a Microsoft Windows server and vSphere Replication is deployed as an appliance, both by the customer. VMware Site Recovery requires a VMware vCenter Server at the customer site as well. The customer site vCenter can be running either VMware vCenter Server version 6.0 U3 or 6.5. There must be one or more vSphere hosts running version 5.0 or higher at the customer site. VMware Site Recovery utilizes vSphere Replication for transferring data between sites.
VMware Site Recovery and VMware vCenter Server as well as the workloads they are protecting require infrastructure services like DNS, DHCP and Active Directory. These must be in place at both the protected and recovery sites.
VMware Site Recovery is managed using the new HTML5 based web interface. During the installation of Site Recovery Manager, a plugin labeled “Site Recovery” is installed in the vSphere Web Client and an icon labeled “Site Recovery” is displayed.
VMware Site Recovery supports protection for up to 1000 virtual machines, 250 recovery plan and is able to simultaneously run up to 10 recovery plans. Up to 250 virtual machines can be included in a single protection group and VMware Site Recovery supports up to 250 protection groups.
Though the most obvious use case for VMware Site Recovery is disaster recovery from one site to another, it can handle a number of different use cases and provide significant capability and flexibility to customers. For all use cases and situations, VMware Site Recovery supports non-disruptive testing of recovery plans in network and storage isolated environments. This provides for the ability to test disaster recovery, disaster avoidance, or planned migrations as frequently as desired to ensure confidence in the configuration and operation of recovery plans.
Disaster recovery or an unplanned failover is what VMware Site Recovery was specifically designed to accomplish. This is the most critical but least frequently used use case for VMware Site Recovery. Unexpected site failures don’t happen often but when they do a fast recovery is critical to business. VMware Site Recovery can help in this situation by automating and orchestrating the recovery of critical business systems for partial or full site failures ensuring the fastest RTO.
Preventive failover is another common use case for VMware Site Recovery.
This can be anything from an oncoming storm to the threat of power issues.
VMware Site Recovery allows for the graceful shutdown of virtual machines at the protected site, full replication of data, and ordered startup of virtual machines and applications at the recovery site ensuring app-consistency and zero data loss.
Upgrade and Patch Testing
The VMware Site Recovery test environment provides a perfect location for conducting operating system and application upgrade and patch testing. Test environments are complete copies of production environments configured in an isolated network segment which ensures that testing is as realistic as possible while at the same time not impacting production workloads or replication.
VMware Site Recovery can be used in a number of different failover scenarios depending on customer requirements, constraints, and objectives. All of these arrangements are supported and easily configured.
In the traditional active-passive scenario there is a production site running applications and services and a secondary or recovery site that is idle until needed for recovery. This topology is common and though it provides dedicated recovery resources it means paying for a site, servers, and storage that aren’t utilized much of the time.
VMware Site Recovery can be used where low-priority workloads such as test and development run at the recovery site and are powered off as part of the recovery plan. This allows for the utilization of recovery site resources as well as sufficient capacity for critical systems in case of a disaster.
In situations where production applications are operating at both sites,
VMware Site Recovery supports protecting virtual machines in both directions (eg. virtual machines at Site A protected at site B and virtual machines at site B protected at site A).
Deployment and Configuration
The process of deploying and configuring VMware Site Recovery is simple and logical. This document will cover these steps at a high level. For detailed installation and configuration instructions please see the VMware Site Recovery Installation and Administration Guides.
Enabling the VMware Site Recovery add-on is a single click operation that takes 10-15 minutes to complete. It involves VMware automatically deploying and configuring VMware Site Recovery Manager and vSphere Replication for the customers VMware Cloud on AWS SDDC. If the customer is configuring disaster recovery between two VMware Cloud on AWS SDDCs the entire deployment process is automated for both.
Install on-premises components
If using an on-premises environment, while VMware Site Recovery components are being deployed to the VMware Cloud on AWS SDDC the onpremises components can be downloaded and installed. This entails deploying the vSphere Replication appliance from an OVF and installing VMware Site Recovery Manager on a Windows server.
Create Firewall Rules and Pair sites
Creating firewall rules in VMware Cloud on AWS involves creating four rules to allow the customers on-premises components to communicate with the
VMware Cloud on AWS components. Site pairing connects VMware Site Recovery and vSphere Replication at the two sites together. This enables the two sites to operate with each other.
There are multiple types of inventory mappings in VMware Site Recovery: Resource mappings, folder mappings, and network mappings. These mappings provide default settings for recovered virtual machines. For example, a mapping can be configured between a network port group named “VM Network” at the protected site and a network port group named “vmwarecorp-network-2” at the recovery site. As a result of this mapping, virtual machines connected to “VM Network” at the protected site will, by default, automatically be connected to “vmware-corp-network-2” at the recovery site.
Networks to be used during testing can also be configured in the same area.
Placeholder Virtual Machines and Datastores
For each protected virtual machine VMware Site Recovery creates a placeholder virtual machine at the recovery site. Placeholder virtual machines are contained in a datastore, and registered with the VMware vCenter Server at the site. This datastore is called the “placeholder datastore”. Since placeholder virtual machines do not have virtual disks they consume a minimal amount of storage.
The protected and recovery sites will each require that a datastore that is accessible by all hosts at that site be created or allocated for use as the placeholder datastore. For VMware Site Recovery the vSAN Workload Datastore would be used. Each site requires at least one placeholder datastore to allow for failover as well as failback.
VMware Site Recovery utilizes vSphere Replication to move virtual machine data between sites. vSphere Replication is able to utilize any storage supported by vSphere so there is no requirement for storage arrays, similar or otherwise, at either site.
vSphere Replication supports RPOs from 5 mins to 24 hours and also supports network compression and file-system quiescing for both Windows and Linux.
Protection groups are a way of grouping virtual machines that will be recovered together. In many cases, a protection group will consist of the virtual machines that support a service or application such as email or an accounting system. For example, an application might consist of a two-server database cluster, three application servers, and four web servers. In most cases, it would not be beneficial to fail over part of this application, only two or three of the virtual machines in the example, so all nine virtual machines would be included in a single protection group.
Creating a protection group for each application or service has the benefit of selective testing. Having a protection group for each application enables nondisruptive, low-risk testing of individual applications allowing application owners to non-disruptively test disaster recovery plans as needed. Note that a virtual machine can only belong to a single protection group. However, a protection group can belong to one or more recovery plans.
For virtual machines protected by VMware Site Recovery, deciding what virtual machines are going to belong to what protection group is simple. Since virtual machines are replicated on an individual basis, whatever makes sense from a recovery standpoint. vSphere replication protection groups are not tied to storage type or configuration.
Recovery Plans in VMware Site Recovery are like an automated runbook, controlling all the steps in the recovery process. The recovery plan is the level at which actions like failover, planned migration, testing and re-protect are conducted. A recovery plan contains one or more protection groups and a protection group can be included in more than one recovery plan. This provides for the flexibility to test or recover an application by itself and also test or recover a group of applications or the entire site.
In the example above there are two protection groups: Accounting and Email.
And there are three recovery plans: The Accounting recovery plan containing the Accounting protection group, the Email recovery plan containing the Email protection group, and the Entire Site recovery plan containing both protection groups.
There are five priority groups in VMware Site Recovery. The virtual machines in priority group one are recovered first, then the virtual machines in priority group two are recovered, and so on. All virtual machines in a priority group are started at the same time and the next priority group is started only after all virtual machines are booted up and responding.
This provides administrators one option for prioritizing the recovery of virtual machines. For example, the most important virtual machines with the lowest RTO are typically placed in the first priority group and less important virtual machines in subsequent priority groups. Another example is by application tier - database servers could be placed in priority group two; application and middleware servers in priority group 3; client and web servers in priority group four.
When more granularity is needed for startup order dependencies can be used. A dependency requires that before a virtual machine can start, a specific other virtual machine must already be running. For example, a virtual machine named “acct02” can be configured to have a dependency on a virtual machine named “acct01” - VMware Site Recovery will wait until “acct01” starts before powering on “acct02”. VMware Tools heartbeats are used to validate when a virtual machine has started successfully.
Shutdown and Startup Actions
Shutdown actions apply to the protected virtual machines at the protected site during the run of a recovery plan. Shutdown actions are not used during the test of a recovery plan. By default, VMware Site Recovery will issue a guest OS shutdown, which requires VMware Tools and there is a time limit of five minutes. The time limit can be modified. If the guest OS shutdown fails and the time limit is reached, the virtual machine is powered off. Shutting down and powering off the protected virtual machines at the protected site when
running a recovery plan is important for a few reasons. First, shutting it down quiesces the guest OS and applications before the final storage synchronization occurs. And second, it avoids the potential conflict of having virtual machines with duplicate network configurations on the same network
Optionally, the shutdown action can be changed to simply power off virtual machines. Powering off virtual machines does not shut them down gracefully, but this option can reduce recovery times in situations where the protected site and recovery site maintain network connectivity during the run (not test) of a recovery plan. An example of this is a disaster avoidance scenario.
A startup action applies to a virtual machine that is recovered by VMware Site Recovery. Powering on a virtual machine after it is recovered is the default setting. In some cases, it might be desirable to recover a virtual machine, but leave it powered off. Startup actions are applied when a recovery plan is tested or run.
Pre and Post Power On Steps
As part of a recovery plan, VMware Site Recovery can run a command on a recovered virtual machine after powering it on. A common use case is calling a script to perform actions such as making changes to DNS and modifying application settings on a physical server. VMware Site Recovery can also display a visual prompt before or after any step in the recovery plan. This prompt might be used to remind an operator to place a call to an application owner, modify the configuration of a router, or verify the status of a physical machine.
The most commonly modified virtual machine recovery property is IP customization. The majority of organizations have different IP address ranges at the protected and recovery sites. When a virtual machine is failed over, VMware Site Recovery can automatically change the network configuration (IP address, default gateway, etc.) of the virtual network interface card(s) in the virtual machine. This functionality is available in both failover and failback operations.
There are multiple IP customization modes in VMware Site Recovery. For example, it is possible to create an IP customization rule that maps one range of IP addresses to another. In the figure below, an administrator has mapped
10.10.10.0/24 to 192.168.100.0/24.
Testing and Cleanup
After creating a recovery plan, it is beneficial to test the recovery plan to verify it works as expected. VMware Site Recovery features a non-disruptive testing mechanism to facilitate testing at any time. It is common for an organization to test a recovery plan multiple times after creation to resolve any issues encountered the first time the recovery plan was tested.
When testing a recovery plan, there is an option to replicate recent changes, which is enabled by default. Replicating recent changes will provide the latest data for the testing process. However, it will also lengthen the amount of time required to recover virtual machines in the recovery plan, as replication has to finish before the virtual machines are recovered. This is useful for testing a disaster avoidance scenario. Unchecking the option to replicate recent changes provides a more realistic disaster recovery test.
A question often asked is whether replication continues during the test of a recovery plan. The answer is yes. vSphere Replication utilizes virtual machine snapshots at the recovery site as part of the recovery plan test process. This approach allows powering on and modifying virtual machines recovered as part of the test while replication continues to avoid RPO violations.
To keep test networks isolated VMware Site Recovery supports two different options. One, automatically created networks on each host. This has the advantage of not requiring any additional networking configuration. However, because this option limits VM connectivity it is best used for testing the function of the recovery plan, not for testing application functionality.
The second network testing option is manually created test network(s) that are configured to duplicate production networks at the recovery site without a connection to the production portion of the network. This is easily possible at the VMware Cloud on AWS site through the tight integration with VMware NSX. See the administration and configuration guide for details about configuring this. This option requires additional configuration upfront and provides the ability to fully test the functionality of both the recovery plan and the application.
At this point, guest operating system administrators and application owners can log into their recovered virtual machines to verify functionality, perform additional testing, and so on. VMware Site Recovery easily supports recovery plan testing periods of varying lengths - from a few minutes to several days. However, longer tests tend to consume more storage capacity at the recovery site. This is due to the nature of snapshot growth as data is written to the snapshot.
When testing is complete, a recovery plan must be “cleaned up”. This operation powers off virtual machines and removes snapshots associated with the test. Once the cleanup workflow is finished, the recovery plan is ready for testing or running.
Planned Migration and Disaster Recovery
Running a recovery plan differs from testing a recovery plan. Testing a recovery plan does not disrupt virtual machines at the protected site. When running a recovery plan, VMware Site Recovery will attempt to shut down virtual machines at the protected site before the recovery process begins at the recovery site. Recovery plans are run when a disaster has occurred and failover is required or when a planned migration is desired.
Clicking the Run Recovery Plan button opens a confirmation window requiring the selection of a recovery type - either a planned migration or a disaster recovery. In both cases, VMware Site Recovery will attempt to replicate recent changes from the protected site to the recovery site. It is assumed that for a planned migration, no loss of data, is the priority.
A planned migration will be canceled if errors in the workflow are encountered. For disaster recovery, the priority is recovering workloads as quickly as possible after disaster strikes. A disaster recovery workflow will continue even if errors occur. The default selection is a planned migration.
After a recovery type is selected, the operator must also populate a confirmation checkbox as an additional safety measure. The idea behind this checkbox is to make sure the operator knows that he or she is running (not testing) a recovery plan.
The first step in running a recovery plan is the attempt to synchronize the virtual machine storage. Then, protected virtual machines at the protected site are shut down. This effectively quiesces the virtual machines and commits any final changes to disk as the virtual machines complete the shutdown process. Storage is synchronized again to replicate any changes made during the shutdown of the virtual machines. Replication is performed twice to minimize downtime and data loss.
If the protected/customer site is offline due to a disaster, for example, the disaster recovery type should be selected. VMware Site Recovery will still attempt to synchronize storage as described in the previous paragraph. Since the protected site is offline, VMware Site Recovery will begin recovering virtual machines at the recovery site using the most recently replicated data.
Re-Protect and Failback
VMware Site Recovery features the ability to not only failover virtual machine workloads, but also fail them back to their original site. However, this assumes that the original protected site is still intact and operational. An example of this is a disaster avoidance situation: The threat could be rising floodwaters from a major storm and VMware Site Recovery is used to migrate virtual machines from the protected site to the recovery site. Fortunately, the floodwater subsides before any damage was done leaving the protected site unharmed.
A recovery plan cannot be immediately failed back from the recovery site to the original protected site. The recovery plan must first undergo a re-protect workflow. This operation involves reversing replication and setting up the recovery plan to run in the opposite direction.
When workflows such as a recovery plan test and cleanup are performed in VMware Site Recovery, history reports are automatically generated. These reports document items such as the workflow name, execution times, successful operations, failures, and error messages. History reports are useful for a number of reasons including internal auditing, proof of disaster recovery protection for regulatory requirements, and troubleshooting. Reports can be exported to HTML, XML, CSV, or a Microsoft Excel or Word document. Click here for a sample history report.
For more information about vSphere Site Recovery for VMware Cloud on AWS, please visit the product pages . Below are links to documentation and other resources:
Product Documentation (includes Install Guide, Administration Guide, and more)
Recovery time objective (RTO): The targeted amount of time a business process should be restored after a disaster or disruption in order to avoid unacceptable consequences associated with a break in business continuity.
Recovery point objective (RPO): The maximum age of files recovered from backup storage for normal operations to resume if a system goes offline as a result of a hardware, program, or communications failure.
Protected site: Site that contains protected virtual machines. This can be either the customer's datacenter or VMware Cloud on AWS.
Recovery site: Site where protected virtual machines are recovered in the event of a failover. This can be either the customer's datacenter or VMware Cloud on AWS.
Note: It is possible for the same site to serve as a protected site and recovery site when replication is occurring in both directions and VMware Site Recovery is protecting virtual machines at both sites.
Sample Workflow Report