Ransomware Recovery at Scale
One of the key challenges with trying to handle any ransomware recovery situation at scale stems from the nature of the attack itself. Ransomware attacks target the applications (the VMs) within the data center – not the physical data center. As a result, not all VMs in the affected data center inventory are affected at the same time or even in the same way.
Modern ransomware attack methods make it difficult to apply generalized selection criteria to a large set of VMs to find the optimal starting point for recovery operations. The analysis, remediation, and validation of the application workloads will likely also not be a uniform set of tasks across the recovery inventory. In other words, the “fix” for each VM might be slightly different for each item in the same recovery set.
Let’s look closer at the possible workflows and scalability. On one end of workflow scaling, especially as you begin the recovery process, you most likely will recover, analyze, and restore the first critical application VMs individually as you identify known good recovery points, establish testing criteria, and formulate specific recovery processes. This would be a “one at a time” approach.
Once a cadence for recovery has been established, the other end of workflow scaling is to apply the same validated selection criteria, workflow tasks for analysis and recovery actions to many or all remaining VMs in a recovery group. This would be an “all at the same time” approach.
Both are valid use cases, but neither are suitable for working through the total overall recovery effort. The most manageable and most scalable solution lies somewhere in between an all or nothing approach, and that is just what this new feature of VMware Ransomware Recovery provides to the DR administrator.
While working through a ransomware recovery plan, the has always allowed individual / per VM operations. With this release, multi-VM or bulk actions are now also possible for most workflow actions. This enhancement helps streamline recovery activity at scale. Let’s take a closer look at this new capability.
The process starts by determining the collection of VMs that are processed for a ransomware recovery situation. The VM recovery inventory is driven by the Recovery Plan. Each specifies one or more Protection Groups (PGs) to be processed . Each defined in that protection policy.
When the Recovery Plan is run for ransomware recovery, the total set of VMs in the inventory is identified and organized into the VM list for processing through the Guided Workflow. The VM list could contain dozens to hundreds of VMs to be processed for complete recovery. A small example of this VM list when first starting recovery is shown below:
At the beginning of the recovery process, each VM starts out in the Guided Workflow in the “In backup” stage. The goal is to move the VMs through validation and staging as quickly as possible and finally recover them back into the production site as shown in the Guided Workflow diagram below:
At the beginning of a recovery effort, it may make sense to begin by recovering one VM at a time. This approach would apply if you were still trying to determine the best snapshots to recover or otherwise determine when “good” and recoverable VMs exist.
Managing Bulk Operations
Once a key recover point in time is identified, groups of VMs can now be recovered into the , from that designated Recovery Point – in bulk. Note that for VMs in different PGs, different Recovery points per PG can be specified for the bulk operation.
From the VM list, you can search for VMs in the current running Recovery Plan inventory based on different criteria and apply appropriate next actions on that collective set of VMs at the same time as shown below.
The are based on VM name, Protection group membership, and the current guided workflow Recovery state. Each of these filtering constraints can be used to help identify the right subset of VMs to perform the next bulk operation on.
The resulting VM list can also be sorted based on other recovery characteristics of each VM. This sorting makes it easier to identify and select VMs to apply the next desired recovery action.
Note that the next action is also context sensitive to the currently selected VMs and is only presented if it is an allowed operation. For example, VMs that are in multiple different current states (e.g., Backup, Validation, or Staged) cannot be transitioned to the Recovered state – that only makes sense for Staged VMs. However, for VMs in any state, you can easily apply Recovery Point badges or change the Network Isolation levels of those VMs.
The guided workflow management allows for many concurrent operations across the VMs in the recovery set. As the recovery task scales, so does the workflow management. Concurrent bulk operations are also available. Note that in the current release, bulk operations are limited to 50 VMs per selection per action – additional action operations will be queued and serviced as bulk processing proceeds.
With the Bulk VM Operations feature, when it comes to ransomware recovery tasks, it is now even easier to take a divide and conquer approach to scalability of the task, work in parallel with others on the recovery team, and move the VMs under recovery more quickly through the Guided Workflow, all with the right level of granularity and grouping – somewhere between one VM and all VMs – until the overall recovery process is completed.