Failover, Failback, and Cleanup of VMware Site Recovery for Google Cloud VMware Engine
Planned Migration and Disaster Recovery
When running a planned migration, VMware Site Recovery Manager will provide an option to sync the latest data or restore to previous replicated state. The recovery Plans attempt to shut down virtual machines at the protected site before the recovery process begins at the recovery site. Recovery plans are run when a disaster has occurred, and failover is required or when a planned migration is desired.
When you click the Run Recovery Plan on the VMC/SDDC console, you must choose between a planned migration or disaster recovery.
In both cases, VMware Site Recovery Manager will attempt to replicate recent changes from the protected site to the recovery site. It is assumed that for a planned migration, no loss of data, is the priority.
A planned migration will be canceled if errors in the workflow are encountered. For disaster recovery, the priority is recovering workloads as quickly as possible after disaster strikes. A disaster recovery workflow will continue even if errors occur. The default selection is a planned migration, which includes the following steps:
- Try to synchronize the virtual machine storage
- Shut down the protected virtual machines. This effectively quiesces the virtual machines and commits any final changes to disk as the virtual machines complete the shutdown process
- Synchronize storage again to replicate any changes made during the shutdown of the virtual machines.
- Replication is performed twice to minimize downtime and data loss.
Note: During a planned migration, after the replication online sync is completed and the source VMs are shut down, power-on method on them is explicitly disabled by SRM. There is no rollback for planned migration. The only workaround to power on the original VMs at the source site is to remove them from vCenter inventory and re-register (and effectively lose all the VM identify - tasks, events, replication configuration, configuration in SRM).
Reprotect and Failback
After the failover of the VMs to the DR site is completed and workloads are running as usual, you must ensure that the primary site is up and running and then get the latest copy of these workloads replicated back to production/primary site.
SRM provides a feature called reprotect which is used when the primary site is ready to receive the latest changes of workload VMs from DR site.
Use reprotect to sync the latest data from the recovery site before getting workloads running again on the primary site.
Consider a use case where the threat of rising floodwaters from a major storm is expected at the primary site. Using VMware Site Recovery Manager, you can migrate the virtual machines from the protected site to the recovery site. When the primary site is back, you can then sync the latest changes by reversing the replication direction from recovery site to primary site and after all data is replicated, we can failback workloads to original site.
A recovery plan cannot be immediately failed back from the recovery site to the original protected site. The recovery plan must first undergo a reprotect workflow. This operation involves reversing replication and setting up the recovery plan to run in the opposite direction.
Testing and Cleanup
After creating a recovery plan, it is beneficial to test the recovery plan to verify it works as expected. VMware Site Recovery Manager features a non-disruptive testing mechanism to facilitate testing at any time. It is common for an organization to test a recovery plan multiple times after creation to resolve any issues encountered the first time the recovery plan was tested.
When testing is complete, a recovery plan must be “cleaned up”. This operation powers off virtual machines and removes snapshots associated with the test. Once the cleanup workflow is finished, the recovery plan is ready for testing or running.
Note: When a test cleanup is performed, all accumulated replication data is consolidated with the replica base disks. This can take time and can increase RTO if a disaster happens and the consolidation after test clean up is still running.