VMware Cloud DR Ransomware Recovery Guide
Introduction
It’s estimated a ransomware attack targets a business every 11 seconds[1]. On a global scale, experts predict ransomware will cause $20 billion (USD) in damages in 2021[2]. As a result, it’s no surprise that ransomware protection has become an urgent business imperative for organizations, large and small, around the world.
A robust ransomware protection plan must include both preventative and recovery measures. Preventative measures attempt to keep ransomware from getting into the IT environment in the first place. However, for those attacks that do get in, preventative measures are also designed to contain and eradicate the infection before it can cause widespread damage. Because no preventative measures can be 100% successful in perpetuity, organizations must have recovery measures in place. Recovery measures need to provide organizations a reliable, easy-to-use, cost-effective way to fully recover their business-critical applications and data, so they can return their operations to a normal state as soon as possible after an attack.
This guide focuses on the recovery measures that organizations need to consider. Starting with a brief overview of VMware’s holistic approach to ransomware protection, the guide provides practical set-up and recovery steps for VMware Cloud DR that IT professionals responsible for disaster recovery (DR), business continuity, or cybersecurity can use to prepare their organization to better recover from an attack.
[1] https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025/
[2] https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025/
Fighting Back Against Ransomware with VMware
The saying goes, “the best defense is a good offense”. To defend against ransomware, organizations need to go on the offensive and proactively implement measures that help them maintain their operations in the event of an attack. The VMware whitepaper, “Ransomware: Defense in Depth with VMware”, provides a comprehensive overview of how organizations can implement a robust ransomware protection plan with VMware solutions, following guidelines set out by the National Institute of Standards and Technology (NIST). It describes the five key stages of a ransomware protection cycle and the essential activities of each stage, detailed below:
STAGE |
ESSENTIAL ACTIVITIES |
|||
|
Review industry and vendor resiliency best practices |
Conduct vulnerability assessments |
Establish incident response |
Align security processes with DR processes |
|
Map applications |
Define service level agreements (SLAs) and the level of recovery granularity required (replication intervals) |
Set-up DR capabilities |
Set-up next-generation anti-virus (NGAV) |
|
Detect the threat |
Investigate |
Triage the threat |
Visualize the attack sequence to understand the threat |
|
Isolate the threat
|
Conduct proactive threat hunting to uncover other risks |
Identify a recovery point |
Failover to a clean isolated recovery environment |
|
Audit and remediate systems |
Prevent reinfection |
Failback to a clean production site |
Return to a normal state of business |
While VMware can help organizations address every stage, this guide drills down into stage five. It explores how organizations can utilize VMware Cloud DR—VMware’s comprehensive DRaaS solution for the critical recovery phase of the ransomware protection cycle.
Overview of VMware Cloud DR
VMware Cloud Disaster Recovery (DR) is an on-demand disaster recovery service which provides an easy-to-use Software-as-a-Service (SaaS) solution with cloud economics that keeps disaster recovery costs under control. VMware Cloud DR can be used to protect vSphere virtual machines (VMs) by replicating them to the cloud and recovering them, as needed, to a target VMware Cloud Software Defined Data Center (SDDC) on VMware Cloud on AWS. An organization can create the target "recovery" SDDC immediately prior to performing a recovery; it does not need to be provisioned to support replications in the steady state.
Some of the key capabilities of VMware Cloud DR include:
-
Secure, immutable backup copies: VMware’s backup copies are operationally air-gapped—the scale-out cloud filesystem is kept separate from the production environment, so ransomware leakage to the backup storage system is impossible. No existing data is ever overwritten, making the underlying log structured filesystem inherently immutable. For extra security, VMware provides separate authentication and role-based access control (RBAC) for the production environments.
-
Deep history of backup copies: VMware provides recovery point histories that are minutes, months, even years old to enable organizations to fully recover, even if ransomware has been in the environment for a long time.
-
On-demand creation of isolated recovery environments (IRE): VMware’s elastic cloud capacity can be used to easily provision greenfield, clean operating environments for validation, and later, recovery. These environments prevent reinfection during the validation process when administrators need to work with potentially compromised backups. They also buy organizations time, providing a clean recovery site they can rely on in the event of an attack, so they can remediate existing sites and perform appropriate forensics without having to rush to restore service. See Section 4.1 for more details.
-
Convenient, efficient, and iterative validations (rapid experiments): Instant VM power-on enables organizations to complete the critical, iterative process associated with identifying which recovery points, VMs, or data to restore into production. With VMware, organizations aren’t required to always copy data from the backup storage system to the primary storage system. See Section 4.2 for more details.
-
File and folder-level recovery capabilities: Organizations can use VMware Cloud DR to extract specific files or folders from more recent recovery points as part of the recovery process. This extraction can be performed without bringing the associated VM into inventory and powering it on. (coming soon)
-
At-scale recovery with highly automated DR workflows and orchestration: VMware’s powerful orchestration engine has tight integration to source environments, failover environments, and the scale-out cloud filesystem to enable the recovery of 100s of VMs at a time following a very prescriptive recovery plan.
Getting Started - Key Concepts for Ransomware Recovery
Before the rapid ransomware recovery stage can be accomplished, organizations need to do some pre-work to make sure everything will run as it should when the worst happens. This prework includes scoping and building out a disaster recovery (DR) plan. When looking to create a ransomware recovery plan, organizations need to consider the following:
- An Isolated Recovery Environment (IRE)
- Recovery Point Validation
- Backup Data and Application Cleaning
Isolated Recovery Environment
To be prepared, organizations must work off the premise that their backup data has been infected. Most likely, they will not know the precise moment of an infection, so they often assume they should restore their backup data from a point in time before it was encrypted in the production environment. Unfortunately, this might end up re-introducing ransomware into that environment, causing more harm than good.
An alternative is to restore backup data to an isolated recovery environment (IRE), so the ransomware can be fully remediated before migrating any virtual machines back into a production environment. An IRE provides a staging area for restored virtual machines that is isolated from other networks. This means the remediation process can occur without external ransomware triggers and without the risk of infecting other workloads.
VMware Cloud services, such as VMware Cloud on AWS, make it easy to quickly build an IRE. A 2-node VMware Cloud software-defined data center (SDDC) can be the IRE, giving organizations the ability to easily add more hosts for larger recovery and remediation efforts, as needed. The creation and deletion of an SDDC for recovery efforts can be performed in the VMware Cloud DR UI. This simplifies the entire disaster recovery workflow and minimizes any risk or downtime.
Once the remediation process is complete, workloads can be migrated back into the original production site without fear of reintroducing the ransomware. When finished with the IRE, it can be deleted. The automated processes of creating and deleting an SDDC help minimize ransomware recovery times and the cost of maintaining an IRE—organizations simply create and pay for an IRE only when they need it.
Another option is to use a greenfield VMware Cloud on AWS environment to recover workloads and run applications. This option removes the pressure on the IT organization to quickly return an impacted data center to service. It buys the organization some time, so they can conduct the appropriate forensics on the existing infrastructure without prematurely eliminating points of inspection in a rush to recover. For example, typically the only way to recover VMs from an on-site backup would be to delete the infected machines from the datastore to make SAN space available for a restore. Once deleted, all the forensic data on those VMs is lost. A greenfield, clean operating environment eliminates this necessity, allowing organizations to make workloads and applications available in the cloud to keep business running, while they investigate and remediate the attack.
Recovery Point Validation
Part of the ransomware recovery process involves locating a valid restore point that an organization can roll back to for the best recovery option. This point must be either before the infection happened or before the data has been encrypted (before the ransomware key has been pulled or popped up), so it can be extracted.
As a result, it’s not as simple as restoring the most recent backup data, because it is possible the ransomware has been in the environment or encrypted the data without being detected for several hours, days, or possibly even weeks. This means organizations may need to use an older restore point to be certain the data isn’t compromised. To accomplish this, the recovery solution must have robust data retention capabilities that can support hundreds of restore points, ranging from just a few hours to several months in age.
Finding a valid recovery point might require multiple recovery operations before the organization locates data that is not encrypted. Even if the organization locates a copy of the data that is not encrypted, it is possible that the recovery point still contains the ransomware. As a result, organizations may need to quickly restore and validate multiple recovery points, from a handful to a dozen or more, to be certain they have a clean one they can use to recover.
VMware Cloud DR provides an instant virtual machine power-on capability that can simplify and accelerate this restore and validation process. Virtual machines can be started directly from the scale-out cloud filesystem, which stores backup copies at low-cost in a steady state, so organizations can instantly run workloads when needed - there is no need to assemble or migrate data into a usable image before powering on a virtual machine. This reduces the amount of time spent having to locate valid recovery points, which results in faster recovery times. For more information about the scale-out cloud filesystem, check out this blog, “Rapid Ransomware Recovery Needs a New Type of Filesystem”.
Naturally, the restoration and use of data from an older recovery point means the organization will lose any changes that occurred to that data after the backup. The main objective is to find a valid recovery point that minimizes this data loss. This means finding the most recent copy of a virtual machine that can be remediated—either by removing the ransomware or by extracting business-critical data for use in another machine. As with any restore process for data protection and disaster recovery, recovery point validation procedures should be well documented and practiced routinely to minimize recovery times when an actual ransomware attack occurs.
Backup Data and Application Cleaning
As previously mentioned, organizations work based on the assumption that backup data is also infected with ransomware. As a result, they cannot simply recover an older copy of an infected virtual machine back into production, for fear that it will reinfect the environment. Instead, they will need to ensure the data and applications are clean before moving back into production, which takes the following items:
- An IRE(s) to stage recovered workloads for remediation.
- A recovery solution that has robust retention capabilities to give the organization multiple recovery points that can be quickly deployed in the IRE(s).
- Tools to detect and remove a wide variety of ransomware variants from workloads.
- Documented procedures to quickly:
- Restore virtual machines into the IRE.
- Locate valid recovery points for remediation.
- Extract data from a recovered workload.
In addition to the recovery capabilities of VMware Cloud DR, VMware can help organizations detect and remove ransomware with VMware Carbon Black, which provides endpoint protection against many ransomware variants. To ensure organizations benefit from the latest list of known ransomware, Carbon Black can use an updated signature list at the time of recovery to detect any new variants that might not have been known when the backup was originally performed. Carbon Black also includes malware behavioral detection to further improve its ability to uncover and address ransomware.
Detailed Recovery Procedures
While recovery from ransomware has many similarities with traditional disaster recovery, there are also important differences that necessitate developing specialized recovery workflows. This section covers the four main stages relevant to both recovery scenarios:
- Preparation – making sure all the steps have been taken to enable a broad scope of recovery options in the event of a ransomware attack
- Response – being ready to declare a ransomware disaster and initiate recovery tasks, if initial lines of defense end up being insufficient
- Recovery (phase 1) – failing over production workloads from the ransomed site into the IRE for further processing
- Recovery (phase 2) – failing back the remediated production workloads from the IRE back into a clean production site
Preparation
Proper preparation of the environment, staff, and procedures will make invoking an organization’s disaster recovery solution easier and more reliable for both traditional disasters and ransomware.
User and Roles
In addition to the typical disaster recovery roles and users involved (e.g., application owners, VI admins, DR admins, operations coordinators, etc.), organizations will need to include security team members and networking specialists to help with ransomware recovery events. These team members should be included early in the planning and testing activities, so they understand the processes and can set any additional requirements for the recovery activities.
The security team’s role is to make sure the malware can be identified and removed, if possible, as well as help find and validate viable recovery points. When VMs are recovered into the IRE, they may need to be further “cleaned” and scanned before being put back into service. The security team has the expertise, tools, and processes needed to address the compromises in both the protected site and the IRE.
The networking team’s role is to help ensure the necessary isolation steps are taken to prevent further spread, while providing access to the networked resources required to perform relevant recovery tasks, both within the production environment and IRE. Inclusion of the networking team ensures organizations follow the appropriate formal processes and change control activities when setting up and configuring the networks for the protected site and recovery site. Organizations want to follow best practices, but they can’t afford unplanned waiting for approvals during a disaster recovery event, so bringing the networking team into the DR process can help avoid unnecessary, and potentially costly, delays.
Protected Sites Workload Considerations
Organizations should consider which sites and applications they want to protect and recover. When reviewing the specifics of each, they should try to identify any baseline VM templates or golden images that might be useful to add to the inventory. VMware Cloud DR does not protect templates at this time, so they will need to be converted to VM instances to be included in the inventory that is replicated to the SCFS.
These baseline images often provide an ideal reference point for recovery comparison, as well as a good working baseline with which the organization can start remediation tasks. Another benefit, templates and golden images are typically not running 24x7 or part of the operating production environment, is that they are less likely to be affected by a ransomware attack.
In addition, organizations should review the current Backup Considerations for their protected sites to make sure the workloads they want to protect fit within the VMware Cloud DR capabilities.
Protection Groups Structure Considerations
Organizations need to establish Protection Groups, which set the context for the VM inventory that should be recovered in a DR plan. The Protection Groups define the inventory included in that group, along with the schedules for automated snapshots. A snapshot is VMware Cloud DR’s construct for a point-in-time backup, which will become the system’s "recovery points" for use in both test and actual recoveries.
When defining a DR plan, the organization will need to assign actions for the inventory and each VM within the Protection Group. The level of granularity of the actions should match the type of recovery that is needed. Since a VM can belong to more than one Protection Group, there may be some cases where an organization will need a DR-oriented Protection Group that offers broader coverage to be used for simpler recovery. They may also need a few dedicated ransomware-oriented Protection Groups that can assist in the more complicated data merging recovery efforts associated with recovering from an attack.
The schedules define the frequency of the backups to meet the recovery point objective (RPO) and the duration of the snapshot retention in the scale-out cloud filesystem (SCFS), which will be available to support the failover actions. VMware Cloud DR can maintain up to 2,000 snapshots per Protection Group, so it is possible to have near-term Recovery Points (e.g., as low as every 30 minutes or 4 hours, depending on the configuration), as well as long-term retention for weeks or months. For example, a ransomware Protection Group schedule policy might look like this:
- Keep at least 6 snapshots per day for 2 days
- Keep daily snapshots for 7 days
- Keep weekly snapshots for 4 weeks
- Keep monthly snapshots for 6 months
This policy contains about 30 Recovery Points (snapshots). Organizations should determine if they need more coverage–either longer retention or more frequency points–and adjust the Protection Group schedules accordingly. Organizations can consult the VMware Config Max online tool for VMware Cloud DR to understand the limits on Protection Group counts and sizes.
DR Plan Considerations
When dealing with ransomware recovery, the organization will likely need to incorporate more control points in their DR plan to define recovery steps designed to prevent attack propagation and the threat of reinfection. The VMs being restored will need to be checked more closely than a more typical disaster recovery failover. This means, within the DR plans, recovery steps may need to be added to process individual VMs, the inventory of a Protection Group, or another item.
Inserting user input steps is a way for organizations to add check points in the recovery orchestration, giving the recovery administrator prompts to examine the environment manually. For more automated actions, VMware’s script VM capability can be leveraged to run context specific actions within the IRE. As organizations review the applications that will be candidates for a ransomware recovery use case, they will need to determine the granularity that may be needed for processing these different workloads—separate DR plans may warrant more instrumentation of existing plans.
Testing Considerations
Testing DR plans is critical to improving DR readiness. It gives organizations the ability to identify and address potential deficiencies, so they are prepared when the worst happens, regardless of whether it’s an unplanned outage or ransomware attack. During testing tasks, the Recovery SDDC, or IRE, needs to be in place and operational. Organizations may require multiple classes of networks in the IRE to bring infected VMs back into inventory safely; all of which need to be validated, cleaned, and integrated into DR mode operations. They may also need to construct isolated networks to use for VM placement during initial recovery, which need to be verified to ensure they operate as expected.
Organizations can consider using different mappings for test/failover, especially during planning development stages. Folder mapping can be leveraged to provide easier identification (e.g., quarantine folder) of inventory items during testing or partial failovers. The goal is to determine what additional service or utility VMs, specific to what ransomware may impact, are needed in the IRE when it is provisioned. These can be failed over as initial actions during early response activities, or they can be included in the specific DR plans where they may be used, either manually or automatically with script VMs.
When conducting DR plan tests, organization should keep track of the key (or special case) VMs that will be processed in the event of a ransomware attack. The table below is an example of the information that is useful to track from the preparation phase, into the response phase, and through to the recovery phases.
VM name | Protection Group | DR Plan | Base VM Recovery Point | Data VM Recovery Point | Status | Priority / Order |
Response
Once ransomware has been detected in the production (protected) environment, there are many steps that need to be taken to safeguard the remaining infrastructure and respond to the attack. In addition to the tasks dictated by the security team and the ransomware defense systems, there are actions organizations can take before they start the VM recovery tasks, which will ultimately improve the overall success of the recovery.
Construct the IRE
If an IRE is not already prepared and running, the organization should deploy and configure the Recovery SDDC that will be used for the IRE. Note, if there is no need to fail over to VMware Cloud on AWS, the SDDC can be scaled down or removed, as necessary.
The isolated networks defined during the preparation stage should now be constructed, and their connectivity isolation should be verified. For a ransomware recovery, it is best to deploy a new, “clean” SDDC for use as the IRE to prevent reinfection. The VMware Cloud DR management interface provides the mechanisms to perform this deployment. Provisioning and configuring the SDDC will take approximately two hours before it can be used as the IRE.
If the SDDC is already in operation (e.g., in Pilot Light mode) and connected to the protected site that has been infected then it may need further analysis and processing by the security and networking teams (e.g., separation, validation, cleaning) before it can be used as the IRE. To clarify, the VMware Cloud DR replication connection cannot be used to infect the SDDC.
Once deployed, the IRE may be used to run VMs that have been fully recovered until the entire recovery process (phase 1) is complete. This allows organizations to operate directly from the recovery site, which gives them some time to address and remediate the affected site in an orderly manner at a more circumspect pace. To support this dual use mode, the organization must have separate networks for IRE recovery and DR operations, as defined above in the testing considerations.
With VMware Cloud DR, the recovery SDDC can also be scaled up after it is created to accommodate fluctuations in incoming DR workloads. Organizations need enough compute capacity to service the VMs, and enough storage capacity to hold the VMs. It is now possible to run some workloads directly from the SCFS datastore, so the storage vMotion process can be avoided for the SDDC vSAN datastore.
It is useful to run testing from the SCFS to expedite plan run times, by eliminating the storage vMotion background step. When running an actual recovery failover, the workloads can be left on the SCFS, if this meets the organization’s operational needs, to avoid the storage vMotion tasks that can take additional time. This approach also reduces some of the SDDC scalability requirements that are driven by storage consumption needs.
For a particular ransomware event, an organization can determine which workload VMs can be run from the SCFS, and which are better on the SDDC vSAN datastore. Note that currently, the ability to run recovery workloads from the SCFS needs to be enabled by VMware support before it can be part of an organization’s recovery strategy—in the future, the organization will be able to enable the capability themselves.
Review and Organize the DR plans
As part of the initial response to ransomware, it is good practice to check the status of the DR components and overall DR plan compliance. In cases where the recovery SDDC was just deployed, DR plans will need to be activated and checked. Organizations will also need to resolve any issues with the DR plans that may prevent a successful failover and eventual failback. For instance, if the protected site has been segmented or disconnected from VMware Cloud DR and the DR site, organizations can expect to receive a DR plan compliance alert notifying them the protected site is unreachable. This will not affect the actual failover process, since the DR plans work primarily from the VMware Cloud DR (UI and SCFS) and VMware Cloud on AWS (SDDC) elements.
The DR plans control the VM power on state at recovery time. Under normal disaster recovery operations, it may be desirable to have the VMs automatically powered on as part of the recovery. For ransomware recovery, it may be better to leave the VMs in a powered off state until they can be scanned or analyzed by other tools to make sure they have not been infected. Organizations can adjust the VM power on controls in the recovery steps of their DR plans to accommodate for ransomware conditions and the subsequent tasks that need to be performed.
Review snapshot status
As described earlier, the Protection Groups run automatically on schedules which are pre-defined by the organization’s policies. At this point in the ransomware response, organizations should review the current snapshot status and make any appropriate adjustments. For example, it might be useful to perform a manual snapshot for Protection Groups whose next cycle is hours away. The snapshots are independent and immutable in the SCFS, so there is no harm in capturing a more current recovery point that may prove useful for data extraction or forensics at a later stage.
If the protected site where the ransomware has been detected is disconnected from the VMware Cloud DR cloud components, organizations can stop the snapshot and replication schedules. This may also help with controlling alerts from the environment while other ransomware tasks are being performed. The objective is to confirm where things stand with respect to the available recovery points and any automated actions that are being performed by VMware Cloud DR.
Determine Viable Recovery Points
One of the final activities of the response phase is to identify the recovery points that will be used for the actual recovery (failover and failback) operations. The security team can use their tools to help identify recovery points available in VMware Cloud that are potential candidates, investigating if malware has infected the systems or if there are other concerns that could render the points unusable. The details of their investigations can be recorded in the recovery details list as part of their record. It may be necessary to move the point in time further back than suggested, based on the findings and the availability of recovery point snapshots in the SCFS.
As discussed earlier, organizations can quickly iterate over some of the recovery point candidates, since they are mounted directly from the SCFS and can be quickly examined in the IRE. There may be candidate VMs that can be used to evaluate each selected snapshot.
Organizations should also identify any VMs that may require more complex merged recovery actions. Many of the VMs will likely be able to be recovered from a single recovery point and cleared to be put back into service.
The process an organization can follow looks something like this (also shown in Figure 2 - Recovery Point Iteration Workflow):
for each DR plan needed for ransomware recovery
for each snapshot identified as a potential candidate
perform the failover/test
if (VM health check is good) then commit the failover and exit snapshot iteration
else rollback this test and try next snapshot
During this step, as organizations identify candidate snapshots (PIT) they should log the desired recovery point in the recovery details list. This information will be used to drive the DR plans and captured in subsequent runbooks and documentation for future recovery activities.
Recover (phase 1 - failover to VMC Recovery Site)
As organizations begin the initial VM recovery process, they will have some VMs that follow a simple recovery process and others that will need a more complex treatment. The difference will be driven by which recovery point(s) were chosen and the organization’s tolerance for data loss.
If data loss is tolerable, the user can experiment with different recovery points. With VMware Cloud DR, organizations can use instant restore to locate the most recent backup that is either not infected or can probably be cleansed (e.g., with a tool such as Carbon Black). Once the most recent backup is located, cleansed VMs can be promoted to a Cloud DR target.
A simple recovery would entail selecting a previous recovery point and using that image to return the Guest VM into service after it is processed (cleansed). If no good recent backups are found and data losses cannot be tolerated, a more advanced, but also more time-consuming, recovery flow can be used. The goal is to combine a good, but perhaps older VM backup baseline image with the partial content recovered from infected backups to make up any deltas. This general method is shown in the figure below.
Simple VM Recovery
Through the recovery point iteration and evaluation steps performed during the initial response activities, there may be cases where there is only one viable recovery point to restore the VMs in a Protection Group to a workable point in time. In this situation, a straightforward failover of the VMs in that Protection Group will be performed into the IRE. Here, they can be analyzed, processed, and prepared for return to service in the IRE and ready for eventual failback into the production site.
In this situation, the organization can leverage the VM integration script in the DR recovery plan to use the orchestrator to automate some of the tasks on each recovered VM. They can also use the scripts to perform tasks as a separate step across the environment. This helps with larger scale recovery efforts that can leverage the automation capabilities of the orchestrator. For many VMs affected by ransomware, this may be the most expedient method to get them cleared and back into service.
Merged VM Recovery
here are situations where a more complex multi-point recovery is needed for a particular VM due to a lower tolerance in data loss as discussed above. In these cases, the organizations will begin by identifying a baseline candidate VM to use for the failover and subsequent failback. Organizations should choose the best baseline image characteristics for the guest OS as a starting point. Then, from other recovery points in the scale-out cloud filesystem, they can pick the data sets for this VM that are still viable, along with subsets of their data that can be merged into the baseline recovery VM.
For some of these additional recovery points, the exact files may not be fully known, so having access to the entire VM is advantageous. In other cases, there may be specific, well-identified files that can be extracted from a backup set and then applied to the baseline VM. With the current capabilities of VMware Cloud DR, the methods for handling these two scenarios will differ. The figure below shows the basic recovery process for a VM data merge use case.
Identify the Data Merge Recovery Points
Organizations need to determine which Protection Group recovery points are desired for the data merge process. In step 1 in Figure 5, there are 3 candidates—A, B, & C —with A being the oldest point in time and the one the organization will use as the baseline for the recovered VM. The baseline VM does not have to be the oldest recovery point, but it does have to be the last one to get failed over into the IRE. This is the instance that the organization will use to apply the other data to; it’s also what will be used for eventual failback to the production site. During the next steps, it’s important this instance is not disrupted in any ways that would preclude a successful failback through VMware Cloud DR methods. More information can be found in the documentation for Running a Failback Plan.
Recovery Non-baseline "data" VMs
Sometimes data needs to be extracted from a later instance of the same VM—e.g., when the specific files are not already known or the VM needs to go through an additional cleaning process before it can accept the desired data. In these cases, it is possible to recover the target VM from an alternate recovery point and move it into a temporary configuration.
To get a temporary VM copy, the DR plan must be run using the recovery point containing the desired data set—not the baseline VM—into the SCFS datastore. From there, it can be cloned into an alternate VM instance. An organization can use the DR plan testing method to run the plan and clone the temporary VM before performing the DR plan test clean up action.
This provides the most expedient method to recover a copy of the VM’s data sets from another point in time that can be merged. The clone operation can direct the copy the data into the SCFS as an option. This will avoid any unnecessary consumption of storage in the SDDC vSAN datastore. When all the data merging is complete, the temporary VM can be deleted from the SCFS datastore.
This process can be applied to multiple recovery points if needed; for example, an organization may want to recover Temporary_VM1, Temporary_VM2, etc. The process is shown as the blue lines from step #1 in the Figure 5.
Recovery Baseline VM
Once any additional VM recovery point instances have been restored from the SCFS and cloned to temporary VM instances, the desired baseline VM recovery point can be brought into the Recovery SDDC’s (IRE) inventory. This is done by running the DR plan failover activity with the desired snapshot and then committing the plan to inventory. This will create a successful runbook report the organization can use for inventory tracking later, if needed.
The process is very similar to a simple VM recovery operation and provides the foundation for the subsequent data merge. The baseline VM will need to go through the analysis, checks, and changes required by the security team before it can be brought into service. It may also need some basic updates applied to become current again. This will enable the organization, however, to circumvent the need to build a new VM from scratch, which speeds up the overall recovery process and provides a recovery point that can be used as a reference point when being returned to service.
Restore Specific Files
There are two methods organizations can use to get more recent data files from other, potentially more recent, recovery points into the baseline VM. These are:
-
Transfer files from temporary VM instances: There are several methods organizations can use to transfer data from a temporary VM(s) to the baseline VM. These include using a drive share, FTP/SCP copies, and 3rd party transfer tools. Before transferring any data from any version of a temporary VM to the baseline VM, it is recommended to check the VM, and the data being transferred for potential malware or corruption. The temporary VMs can be isolated while they are analyzed; once cleared, they can be used to provide more recent data sets for recovery.
-
Extract files from SCFS recovery points with guest file restore utilities (coming soon): The ability to extract data sets from the Scale-out Cloud File System (SCFS) exists through VMware Cloud DR guest file restore utilities. These utilities allow a DR administrator to view the VM contents of specific recovery point snapshots and extract specific data sets from those backup images into a cloud based (S3) location, without having to bring the VM into the SDDC inventory. The extracted files are presented as a zip archive that can be downloaded and merged into the baseline copy; this approach also allows the files to be processed for any malware before they are merged.
The exact mechanics of guest file restore and subsequent data validation (cleaning) tasks varies from case to case. Sometimes it can be useful to have an intermediary IRE VM (e.g., one for Windows and one for Linux guests) equipped with all the required ransomware utilities, stationed in the IRE for initial data extraction and preparation. From here, the data can be safely merged into the baseline VM with significantly less risk.
Transition DR Site VMs into DR Operations Service
As each VM is recovered into the IRE and prepared to return to service, it can be moved from the isolated areas of the IRE into a DR operations mode of service. This is useful for validating the operations of the VMs that need to work together, making sure they can play their specific roles as part of a larger application. It is important organizations involve the networking team to ensure the networks used in the protected site and in the DR site for IRE operation and DR operations are properly isolated and controlled. This approach enables organizations to incrementally return to service critical application VMs that have cleared the ransomware recovery process. The priority of the recovery operations can be captured in the organization’s recovery details list.
Recover (phase 2 - failback to Protected Site)
The last stage in the recovery process is to failback the VMs from the VMware Cloud on AWS SDDC (IRE) to the original protected site vCenter. To accomplish this, organizations need to create a failback plan for each failover plan that duplicates and reverses the direction of the tasks. Each failback plan can then be edited to make any adjustments that are needed (e.g., to recovery steps, mappings, etc.). Note, this task is considered a Planned Failover and will result in a period of downtime, as the VM’s transition their data sets from the DR site back to the protected site. The service window for this transition should be planned for each DR plan.
Prepare the Protected Site
One objective of any VMware Cloud DR failback operation is to minimize the cloud egress impact—both in cost and transfer time. To support this, VMware Cloud DR will attempt to restore the original protected site back to the same recovery point that was used for the failover. The details of this are covered in the product documentation for Running a Failback Plan. To make this functionality possible, organizations are advised to not delete the VMs from the original protected site. Instead, they are asked to simply power them off once the failover recovery (phase 1) begins, or sooner, if warranted for site security.
Run the Failback process
Organizations can run the failback plan, transitioning the VM workloads that were recovered to the IRE back to their original site. If there are any ‘user input’ recovery steps in the original failover plan, they will also be in the failback plan. These can be removed if the organization determines they are no longer needed. The orchestration of the failback activities can be monitored from the VMware Cloud DR UI.
Once all the DR plans used to failover the VMs to the VMware Cloud on AWS IRE have been processed and the failback plans constructed and run, everything should be back to its original protected site (production) location. Organizations can then use the IRE for forensics, as needed, or simply begin cleaning up the remaining items, such as the temporary VMs and other services, and decommissioning or scaling down the VMware Cloud on AWS SDDC environment.
After recovering the VM workload back to the protected site, it is considered a best practice to activate the original DR plans that were used to failover to the recovery site. Organizations should also check the Protection Group policies to ensure they are active and protecting the production site workloads again.
As a final step, organizations can collect the associated runbooks from the VMware Cloud Disaster Recovery orchestrator and file them along with any “lessons learned”. Ultimately, this concludes the recovery, allowing a return to normal operations, while preserving the ability to recover from a ransomware attack in the future.
Resources and References
- VMware white papers
- VMware blogs
- VMware product documentation
Terminology
- Validating – determining the usable state of the guest VM – no actions taken – either clean or compromised
- Cleaning – actions taken on guest VM to alter its state by removing, replacing, or blocking of malware so it is acceptable to put back into service
- Clean VM – state of the guest VM – ready to put back into service
- Compromised VM – state of the guest VM – malware condition still exists
- Recovery Granularity – the unit of recovery needed to bring a service back into operations – this could be at the file level within a VM, the whole VM, or a collection of VMs
- Operationally Air-gapped – the separation of sites that do not have active networks connecting them – https://blogs.vmware.com/virtualblocks/2021/09/28/operational-air-gaps/
- Immutable – not capable of or susceptible to change – https://www.merriam-webster.com/dictionary/immutable