VMware Ransomware Recovery Field Guide

Introduction

It’s estimated a ransomware attack targets a business every 11 seconds [1] . On a global scale, experts predict ransomware will cause $20 billion (USD) in damages in 2022 [2]. As a result, it’s no surprise that ransomware protection has become an urgent business imperative for organizations, large and small, around the world.

A robust ransomware protection plan must include both preventative and recovery measures. Preventative measures attempt to keep ransomware from getting into the IT environment in the first place. Because no preventative measures can be 100% successful forever, organizations must have recovery measures in place. Recovery measures need to provide organizations a reliable, easy-to-use, cost-effective way to fully recover their business-critical applications and data, so they can return their operations to a normal state as soon as possible after an attack.

This field guide focuses on the ransomware recovery measures that organizations need to consider. Starting with a brief overview of VMware’s holistic approach to ransomware protection. This guide provides practical considerations, set-up and recovery steps for VMware Cloud DR and the additional VMware Ransomware Recovery product for VMware Cloud DR. These are solutions that IT professionals responsible for disaster recovery (DR), business continuity, or cybersecurity can use to prepare their organization to better recover from an attack.

[1] https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025/

[2] https://cybersecurityventures.com/cybercrime-damage-costs-10-trillion-by-2025/

Purpose of This Field Guide

This field guide will take you through the two key products from VMware for recovering from modern ransomware attacks – including VMware Cloud Disaster Recovery and VMware Ransomware Recovery – both  provided “as a Service”. It will also cover some of the adjacent VMware products and technology as applicable to the broader ransomware recovery solution.

It is not the intent of this guide to replace existing product documentation or other online informational or educational resources. Many of these will be highlighted throughout this guide as links to the relevant resources.

As each customer’s environments is slightly different and the underlying technologies are rapidly evolving, this guide is also not intended to be a step by step solution setup document.

The goal of this field guide is to provide a broad overview of the solution at hand, a generalized start to finish roadmap. Throughout this guide, there will be notes and pointers (links) to many of the other resources available to quickly assimilate the solution so you can be better informed and prepared to provide a ransomware recovery solution for your organization.

Audience

This field guide is intended for infrastructure administrators and their corresponding teams and managers that are responsible for the current protection of virtualized (VMware) environments – either on-premises or in VMware Cloud on AWS data centers – and the resulting recovery scenarios when responding to ransomware disasters. The content in this guide will also be useful to networking and security teams that support the initial prevention, detection and response areas of cyber security. These functions will be important during the recovery efforts.

Disaster Recovery Background

The saying goes, “the best defense is a good offense”. To defend against ransomware, organizations need to go on the offensive and proactively implement measures that help them maintain their operations in the event of an attack. The VMware whitepaper, “Ransomware: Defense in Depth with VMware”, provides a comprehensive overview of how organizations can implement a robust ransomware protection plan with VMware solutions. For our coverage in this guide, we will be following guidelines set out by the National Institute of Standards and Technology (NIST). NIST describes the five key stages of a ransomware protection cycle and the essential activities of each stage, detailed below:

STAGE

ESSENTIAL ACTIVITIES

Identify

Review industry and vendor resiliency best practices

Conduct vulnerability assessments

Establish incident response

Align security processes with DR processes

Protect

Map applications

Define service level agreements (SLAs) and the level of recovery granularity required (replication intervals)

Set-up DR capabilities

Set-up next- generation anti-virus (NGAV)

Detect

Detect the threat

Investigate

Triage the threat

Visualize the attack sequence to understand the threat

Respond

Isolate the threat

Conduct proactive threat hunting to uncover other risks

Identify a recovery point

 

Validate in a clean Isolated Recovery Environment

Recover

Audit and remediate systems

Prevent reinfection

Restore to a clean production site

Return to a normal production site state of business

While VMware can help organizations address every stage of this framework, this guide drills down into the final stages – four and five – Respond and Recover. It explores how organizations can utilize VMware Cloud DR along with VMware Ransomware Recovery — VMware’s comprehensive DRaaS solutions for VMware Cloud on AWS for the critical recovery phase of the ransomware protection cycle.

Before getting into the details of the DR and Ransomware Recovery solutions, let’s review a few basic considerations across the larger disaster recovery framework.

Protected Sites

Organizations should consider which sites and applications they want to protect and recover. VMware Cloud DR will be the foundation for protecting these on-premises or VMware Cloud on AWS virtualized workloads to a cloud-based repository integrated with VMware Cloud on AWS to use for the actual recovery tasks.

Organizations should review the current Backup Considerations for their protected sites to make sure the workloads they want to protect fit within the VMware Cloud DR capabilities.

When reviewing the specifics of each site for target application workloads, it is also good practice to identify any baseline VM templates or golden images that might be useful to add to the inventory.

These baseline images often provide an ideal reference point for recovery comparison, as well as a good working baseline with which the organization can start remediation tasks. Another benefit is that templates and golden images are typically not running 24x7 or part of the operating production environment, and that they are less likely to be affected by a ransomware attack.

NOTE: VMware Cloud DR does not protect templates at this time, so they will need to be converted to VM instances to be included in the inventory that is replicated to the Scale-out Cloud File System (SCFS).

Recovery Site

For both VMware Ransomware Recovery or VMware Cloud DR, the recovery site will be a VMware Cloud on AWS Software Defined Data Center (SDDC). This resource will be used in slightly different ways depending on the recovery needs which we will go into more detail later in this guide. The recovery site can be deployed in a ‘Just-in-time’ fashion when needed and leverage the elasticity of the cloud – only paying for critical resources when needed.

For ransomware recoveries, we will begin calling the recovery site an Isolated Recovery Environment or IRE. More details on this topic later.

NOTE: The smallest VMware Cloud on AWS SDDC for production use can be just 2 hosts. Depending on compute and/or storage requirements, this SDDC can be scaled up or scaled out to accommodate the disaster recovery workloads being processed.

User and Roles

In addition to the typical disaster recovery roles and users involved (e.g., application owners, VI admins, DR admins, operations coordinators, etc.), organizations will need to include security team members and networking specialists to help with ransomware recovery events. These team members should be included early in the planning, design and testing activities, so they understand the processes and can set any additional requirements for the recovery activities.

The security team’s role is to make sure the malware can be identified and removed, if possible, as well as help find and validate viable recovery points. When VMs are recovered into the recovery site SDDC/IRE, they may need to be further analyzed, validated, scanned and “cleaned” before being put back into service. The security team has the expertise, tools, and processes needed to address the compromises in both the protected site and the recovery site.

The networking team’s role is to help ensure the necessary isolation steps are taken to prevent further spread, while providing access to the networked resources required to perform relevant recovery tasks, both within the production environment and the IRE. Inclusion of the networking team ensures organizations follow the appropriate formal processes and change control activities when setting up and configuring the networks for the protected site and recovery site. Organizations want to follow best practices, but they can’t afford unplanned waiting for approvals during a disaster recovery event, so bringing the networking team into the DR process can help avoid unnecessary, and potentially costly, delays.

NOTE: for the VMware Ransomware Recovery solution, the default security tools are VMware Carbon Black Cloud and the networking security tools will leverage NSX.

VMware Cloud DR

VMware Cloud Disaster Recovery (DR) is an on-demand disaster recovery service which provides an easy-to-use Software-as-a-Service (SaaS) solution with cloud economics that keeps disaster recovery costs under control. VMware Cloud DR can be used to protect vSphere virtual machines (VMs) by replicating them to the cloud and recovering them, as needed, to a target Software Defined Data Center (SDDC) on VMware Cloud on AWS.

NOTE: An organization can create the target "recovery" SDDC/IRE immediately prior to performing a recovery; it does not need to be provisioned to support VM data replications in the steady state.

Some of the key capabilities of VMware Cloud DR that are applicable to ransomware recovery situations include:

  • Secure, immutable backup copies: VMware Cloud DR’s backup copies are operationally air-gapped – the Scale-out Cloud File System (SCFS) is kept in an offsite VMware Cloud on AWS environment, separate from the production environment, so ransomware leakage in the backup storage system is prevented. No existing recovery point data is ever overwritten. The underlying log structured filesystem used for storing the recovery points is inherently immutable. There is a good description of the Cloud File System and ransomware use case in this blog “Rapid Ransomware Recovery Needs a New Type of Filesystem”.
  • For extra access security, VMware provides separate authentication and role-based access control (RBAC) for the cloud based DR environments.
  • Deep history of backup copies: VMware provides recovery point histories that are minutes, months, even years old to enable organizations to better recover. The deeper retention helps mitigate situations where ransomware has been in the environment for a long time and quick access to older recovery points when forensics or baseline restore selection is desired.
  • SDDC augmentation for use as Isolated Recovery Environments (IRE): VMware Cloud on AWS’s elastic cloud capacity can be used to easily provision greenfield, clean operating environments for use in validation, and later, as recovery. These Isolated Recovery Environments can help prevent reinfection during the validation process when administrators need to work with potentially compromised backups. They also buy organizations time, providing a clean recovery site they can rely on in the event of an attack. This cloud based recovery site helps while they remediate existing production sites and perform appropriate forensics without having to rush to restore service in an already compromised production site.
  • Convenient, efficient, and iterative validations (rapid experiments): Instant VM power-on, from the Cloud File System to the IRE SDDC, enables organizations to more quickly complete the critical, iterative process associated with identifying which recovery points, VMs, or data to restore into production. With VMware Cloud DR, organizations aren’t required to always copy data from the backup storage system to the primary storage system.
  • File and folder-level recovery capabilities: Organizations can use VMware Cloud DR to extract specific files or folders from more recent recovery points as part of the recovery process. This extraction can be performed without bringing the associated VM into inventory and powering it on.
  • At-scale recovery with highly automated DR workflows and orchestration: VMware Cloud DR’s powerful orchestration engine has tight integration to the protected (source) environment, recovery (failover) environment, and the Scale-out Cloud File System to enable the recovery of VMs at scale following a very prescriptive recovery plan.

Protection Groups

Organizations can use flexible policy defined Protection Groups, which set the context for the VM inventory that will be recovered in a Recovery plan. The Protection Groups define the inventory included in that group, along with the schedules for automated snapshots and replication to the Cloud File System. A snapshot is VMware Cloud DR’s construct for a point-in-time backup, which will become the system’s "recovery points" for use in both test and actual recoveries.

When defining a more traditional site Recovery plan, the organization will assign actions for the inventory and each VM within the Protection Group. The level of granularity of the actions should match the type of recovery that is needed. Since a VM can belong to more than one Protection Group, there may be some cases where an organization will need a DR-oriented Protection Group that offers broader coverage to be used for simpler recovery. They may also need a few dedicated ransomware-oriented Protection Groups that can assist in the more complicated iterative and data merging recovery efforts associated with recovering from this type of an attack.

The Protection Group schedules define the frequency of the backups to meet the recovery point objective (RPO) and the duration of the snapshot retention in the Scale-out Cloud File System (SCFS), which will be available to support the failover actions. VMware Cloud DR can maintain up to 2,000 snapshots per Protection Group, so it is possible to have many near-term Recovery Points (e.g., as low as every 30 minutes or 4 hours, depending on the configuration), as well as sufficient long-term retention for weeks or months. For example, a ransomware oriented Protection Group schedule policy might look like this:

  • Keep at least 1 snapshot per hour for 5 days
  • Keep daily snapshots for 20 days
  • Keep weekly snapshots for 4 weeks
  • Keep monthly snapshots for 3-6 months

This example policy contains about 150 Recovery Points (snapshots) – well below the operational limits of the product. Organizations should determine if they need more coverage – either longer retention or more frequent recovery points – and adjust the Protection Group schedules accordingly.

NOTE: It is always good practice to consult the VMware Config Max online tool for VMware Cloud DR to understand the limits on Protection Group counts and sizes as well as other system configuration limits.

Recovery Plans

When dealing with ransomware recovery, the organization will likely need to incorporate more control points in their Recovery plan and guide the recovery steps to prevent attack propagation and the threat of reinfection. The individual VMs being restored will need to be checked more closely than a more typical site disaster recovery failover. One of the new features and key differences between site DR and ransomware enabled Recovery plans is the granularity and individual VM recovery handling. We will go into more detail later in this guide on the guided recovery workflow applied to each VM identified in a Recovery plan that will be used for ransomware recovery.

Outside of running plans for ransomware recovery, inserting user input steps is a way for organizations to add check points in the recovery orchestration, giving the recovery administrator prompts to examine the environment manually. For more automated actions, VMware Cloud DR’s script VM capability can be leveraged to run context specific actions within the IRE. As organizations review the applications that will be candidates for a ransomware recovery use case, they will need to determine the granularity that may be needed for processing these different workloads — separate Recovery plans may warrant more instrumentation of existing plans.

Cloud File System (also called the Scale-out Cloud File System / SCFS)

The Cloud File System provides the data starting point of the recovery process as each VM recovered will come from a selected point-in-time recovery point from the Cloud File System inventory. The Cloud File System is connected to the recovery SDDC (IRE) as an NFS datastore and the selected version of the VM is simply brought into inventory based on the Recovery plan specifications. A single Cloud File System repository can hold a significant amount of VMs and their corresponding versions. During some recovery operations and for ransomware recovery in particular, the VM can actually run off the Cloud File System and not require SDDC VSAN capacity or additional time and process to Storage vMotion fully into the SDDC framework. This capability greatly speeds up the iteration cycle time as well as the overall recovery duration.

You can find more information about running VMs on the Cloud File System during site disaster recovery in this article.

VMware Cloud on AWS

Both of the disaster recovery solutions discussed in this guide are built upon the VMware Cloud (VMC) on AWS platform and are procured as an additional service in the Cloud Services Portal for VMC. This extends the overall managed as a service paradigm for the VMware Cloud solutions.

SDDC / IRE

The VMC Software Defined Data Center (SDDC) environment is the core of the recovery site solution. This is the case whether it is being used as a full site disaster recovery site for VMware Cloud DR, or as the special purpose Isolated Recovery Environment (IRE) for VMware Ransomware recovery scenarios. The SDDC can be deployed ‘Just-in-time’ (on-demand) or can be deployed as a minimal sized always-on configuration (also known as Pilot Light). In either case, the configuration can take advantage of cloud elasticity and scale to the desired size (compute resources or storage capacity) to serve the appropriate recovery needs.

NOTE: For ransomware recovery situations where the VMs are being iterated through in the guide workflow, the final size of the required SDDC being used as the IRE may be smaller than the SDDC size than would be required to support a full site failover.

NSX Advanced Firewall

VMware Cloud on AWS includes the NSX – Networking and Security – features needed for basic network management of the SDDC within the VMC and broader internet environment. These networking capabilities can easily be leveraged for a more secure and connected configuration of the SDDC for production use, disaster recovery scenarios, as well as ransomware recovery needs. The integration of particular features such as NSX Advanced Firewall support will provide the network isolation functionality discussed later in this guide.

NOTE: For the Advanced Firewall capabilities used to provide some of the integration, there may be additional per-use charges incurred during ransomware plan operations – either for testing or recovery.

VMware Carbon Black Cloud

Another service available through the VMware Cloud portal is the Carbon Black Cloud Endpoint and Workload Protection platform. This advanced cybersecurity solution provides improved capabilities against modern threats with the addition of behavioral analysis of running workloads. This is critical in the detection and remediation of ransomware attacks and will be used in the overall Ransomware Recovery solution described in this guide.

Graphical user interface, text, application

Description automatically generated

Next Gen Anti-Virus (NGAV)

The latest malware attacks are getting more sophisticated and using techniques not easily detected by simple system scans that only look for known vulnerabilities or existing virus signatures. The new attack vectors exploit fileless attacks that are only observable when a system is running and exhibits an undesirable behavior. NGAV solutions need to include behavioral analysis as part of the validation methods applied. We’ll look more closely at the integration of the Carbon Black Cloud capabilities later in this guide.

Ransomware Recovery


When the disaster at hand is a cybersecurity, ransomware attack and the systems in the production site cannot be easily remediated, then a good, solid recovery plan and process is your best alternative.

VMware Ransomware Recovery is an add-on product to address this need that works directly with VMware Cloud DR – both are provided as an easy-to-use Software-as-a-Service (SaaS) solution.

Some of the key capabilities of the VMware Ransomware Recovery solution include:

  • Dedicated Ransomware Recovery Workflow: The iterative process of identifying and evaluating each VM for viable recovery is managed through the recovery UI to help simplify the process for successful recovery tasks.
  • Restore Point Selection Criteria: Greater visibility into the inventory of recovery point’s data characteristics such as change rate and entropy, combined with individual selection for each VM in the workflow is provided through the UI. This helps further simplify the best case selection of recovery point(s) and minimize data loss situations.
  • Behavioral Analysis of Powered-on VMs: Going beyond just simple static file systems scanning, the solution enables Next Gen AV (NGAV) capabilities to operate on running VMs and monitor the behavior to deal with more sophisticated fileless attack scenarios.
  • Pre-configured VM Isolation Levels: Managing the network access to the VMs in the Isolated Recovery Environment (IRE) is further formalized and simplified with a “push button” approach to controlling the isolation level for each VM in the recovery process.

The combination of the core capabilities of VMware Cloud DR plus the additional features of VMware Ransomware Recovery solution provide the ideal set of tools and environment to provide that last line of defense against ransomware – an effective recovery.

Introduction


Before covering the details of the Ransomware Recovery solution, let’s review some of the key differences between a ransomware recovery and a more traditional site disaster recovery. Many of the differences are outlined in the documentation and other sources, but a simple synopsis is worth including here:

  1. The production site is still available but the content (applications and VMs) have been compromised – the goal is to get the production site operational
  2. Finding and validating the best recovery point(s) typically involves an iterative and piecemeal workflow as opposed to a more linear runbook style of recovery and bringing order to a more complicated process is essential
  3. Performing recovery tasks quickly and in a controlled environment is critical to minimize the overall recovery time and reduce the risk of reinfection of the applications being handled

We will cover these topics as well as well as a few other related considerations in the remaining parts of this guide.

The figure below outlines one of the basic challenges of a ransomware recovery. The point of detection and impact to the systems – shown in the vertical red lines on the right side of the timeline – may be separated by days, weeks or longer, from the initial attack point when the ransomware was introduced – shown as the skull on the left side of the timeline.

It may be necessary to iterate over the recovery points identified in the protection schedules to find the best usable VM and data set(s) to proceed with, for the recovery and validation tasks.

Preparations

Before the ransomware recovery processes can be accomplished, organizations need to do some pre-work to make sure everything will run as it should when the recovery time comes. This prework includes scoping and building out the necessary Protection Groups and Recovery plans and making sure they are tested and ready for use.

Testing Considerations

Testing Recovery plans is critical to improving readiness. It gives organizations the ability to identify and address potential deficiencies, so they are prepared when the worst happens, regardless of whether it’s an unplanned outage or ransomware attack. During testing tasks, the Recovery SDDC, or IRE, needs to be in place and operational. Organizations may require multiple classes of networks in the IRE to bring infected VMs back into inventory safely; all of which need to be validated, cleaned, and integrated into recovery mode operations. The need for network isolation can be constructed with NSX Advanced Firewall methods that are integrated into the IRE framework with the Ransomware Recovery setup.

Organizations can consider using different mappings for test/failover, especially during planning development stages. Folder mapping can be leveraged to provide easier identification (e.g., quarantine folder) of inventory items during testing or partial failovers.

Another goal during testing is to determine what additional service or utility VMs, specific to what ransomware may impact, are needed in the IRE when it is provisioned. These can be failed over as initial actions during early response activities, or they can be included in the specific Recovery plans where they may be used, either manually or automatically with script VMs.

NOTE: The status of Recovery Points used during testing phases can be captured in the Badges construct provided with the Ransomware Recovery capabilities. Badges are covered in a later section.

Recovery Procedures

While recovery from ransomware has many similarities with traditional site disaster recovery, there are also important differences that necessitate developing specialized recovery workflows. This section covers the main stages relevant to ransomware recovery scenarios:

  1. Response – declaring a ransomware disaster and initiating the recovery workflow tasks
  2. Recovery (phase 1) – recovering and validating individual VM recovery points into the IRE and preparing them for eventual recovery back into the Production site
  3. Recovery (phase 2) – failing back the remediated production workloads from the IRE back into a clean production site

Response

Once ransomware has been detected in the production (protected) environment, there are many steps that need to be taken to safeguard the remaining infrastructure and respond to the attack. In addition to the tasks dictated by the security team and the ransomware defense systems, there are actions organizations can take before they start the VM recovery tasks, which will ultimately improve the overall success of the recovery.

Isolated Recovery Environment

To be prepared, organizations must work off the premise that their backup data has been infected. Most likely, they will not know the precise moment of an infection, so they often assume they should restore their backup data from a point in time before it was encrypted in the production environment. Unfortunately, that backup copy may be impacted and could end up re-introducing ransomware into that environment, causing more harm than good.

An alternative to simply trying to restore backups into the production environment, is to restore backup data to an Isolated Recovery Environment (IRE). This provide a safer environment, so the ransomware can be fully remediated before migrating any virtual machines back into a production environment. An IRE provides a staging area for restored virtual machines to run that can be isolated from other networks. This means the remediation process can occur without external ransomware triggers and without the risk of (re)infecting other workloads.

If an IRE is not already prepared and running, the organization should deploy and configure the Recovery SDDC that will be used for the IRE.

The application networks defined during the preparation stage should now be constructed, and their connectivity isolation should be verified. For a ransomware recovery, it is best to deploy a new, “clean” SDDC for use as the IRE to prevent reinfection. The VMware Cloud DR management interface provides the mechanisms to perform this deployment. Provisioning and configuring the SDDC will take approximately two hours before it can be used as the IRE.

If the SDDC is already in operation (e.g., in Pilot Light mode) and connected to the protected site that has been infected then it may need further analysis and processing by the security and networking teams (e.g., separation, validation, cleaning) before it can be used as the IRE.

Once deployed, the IRE may be used to run VMs that have been fully recovered until the entire recovery process (phase 1) is complete.

With VMware Cloud DR, the recovery SDDC can also be scaled up after it is created to accommodate fluctuations in incoming recovery workloads. Organizations need enough compute capacity to service the VMs, and enough storage capacity to hold the VMs. During ransomware recovery use, it is  possible to run the workloads directly from the Cloud File System datastore, so the storage vMotion process can be avoided for the SDDC vSAN datastore.

In general, it is useful to run testing from the Cloud File System to expedite plan run times, by eliminating the storage vMotion background step. When running an actual ransomware recovery failover, the workloads will be left on the Cloud File System. This approach also reduces some of the SDDC scalability requirements that are driven by storage consumption needs.

The creation and deletion of an SDDC for recovery efforts can be performed in the VMware Cloud DR UI. Minimizing the number of tools and interfaces needed helps simplify the entire disaster recovery workflow and reduces risk or downtime associated with the cloud provisioning processes.

Once the remediation process is complete, workloads can be migrated back into the original production site without fear of reintroducing the ransomware. When finished with the IRE, it can be deleted. The automated processes of creating and deleting an SDDC help minimize ransomware recovery times and the cost of maintaining an IRE—organizations simply create and pay for an IRE only when they need it.

Another option is to augment the IRE with a greenfield VMware Cloud on AWS environment as a DR site to recover workloads and run applications once remediated. This option removes the pressure on the IT organization to quickly return an impacted data center to service. It buys the organization some time, so they can conduct the appropriate forensics on the existing infrastructure without prematurely eliminating points of inspection in a rush to recover. Sometimes the only way to recover VMs from an on-site backup would be to delete the infected machines from the datastore to make storage space available for a restore. Once deleted, all the forensic data on those VMs is lost. A greenfield, clean operating environment eliminates this necessity, allowing organizations to make workloads and applications available in the cloud to keep business running, while they investigate and remediate the attack.

Review and Organize the Recovery plans

As part of the initial response to ransomware, it is good practice to check the status of the recovery components and overall recovery plan compliance. In cases where the recovery SDDC was just deployed, recovery plans will need to be activated and checked. Organizations will also need to resolve any issues with the recovery plans that may prevent a successful ransomware recovery procedure. For instance, if the protected site has been segmented or disconnected from VMware Cloud DR and the DR site, organizations can expect to receive a Recovery plan compliance alert notifying them the protected site is unreachable. This will not affect the actual failover process, since the Recovery plans work primarily from the VMware Cloud DR (UI and SCFS) and VMware Cloud on AWS (SDDC) elements.

The Recovery plans control the VM power on state at recovery time. Under normal disaster recovery operations, it may be desirable to have the VMs automatically powered on as part of the recovery. For ransomware recovery, the VMs will be powered on when they are recovered to the IRE. 

Review Snapshot Status

As described earlier, the Protection Groups run automatically on schedules which are pre-defined by the organization’s policies. At this point in the ransomware response, organizations should review the current snapshot status and make any appropriate adjustments. For example, it might be useful to perform a manual snapshot for Protection Groups whose next cycle is hours away. The snapshots are independent and immutable in the Cloud File System, so there is no harm in capturing a more current recovery point that may prove useful for data extraction or forensics at a later stage.

If the protected site where the ransomware has been detected is disconnected from the VMware Cloud DR cloud components, organizations can stop the snapshot and replication schedules. This may also help with controlling alerts from the environment while other ransomware tasks are being performed. The objective is to confirm where things stand with respect to the available recovery points and any automated actions that are being performed by VMware Cloud DR.

NOTE: During a ransomware recovery operation, the Protection Group (PG) expiration task is suspended for any PG involved in the recovery plan. This will generate warnings at the Protection Group level which can safely be ignored while working on the recovery at hand. It also prevents any snapshots from inadvertently being expired that may have proven useful in the other VM or data recovery tasks.

Recovery Point Selection

Part of the ransomware recovery process involves locating a valid restore point that an organization can use as the best recovery option. This recovery point must be either before the infection happened or before the data has been encrypted / compromised so that is can still be used.

As a result, it’s not as simple as restoring the most recent backup data, because it is possible the ransomware has been in the environment or encrypted the data without being detected for several hours, days, or possibly even weeks. This means organizations may need to use an older restore point to be certain the data isn’t compromised. To accomplish this, the recovery solution must have robust data retention capabilities that can support hundreds of restore points, ranging from just a few hours to possibly several months in age.

Finding a valid recovery point might require multiple recovery operations before the organization locates data that is not compromised (encrypted) and still viable. Even if the organization locates a copy of the data that is not encrypted, it is possible that the recovery point still contains the ransomware. As a result, organizations may need to quickly restore and validate alternate recovery points, from a handful to a dozen or more, to be certain they have a clean one they can use to recover.

There are some additional capabilities added to the ransomware recovery workflow UI around recovery point selection. The list of available recovery points is shown graphically in the timeline also showing any previously applied badges marking the status of certain points as shown in the figure below.

image-20230804091529-1

In addition to this badge status tracking, the timeline also shows data change rate and snapshot entropy data for each snapshot taken along the protection timeline. By reviewing this information, recovery admins can make better choices about where to begin the validation process.

VMware Cloud DR provides an instant virtual machine power-on capability that can simplify and accelerate this restore and validation process. Virtual machines can be started directly from the Scale-out Cloud File System (SCFS) with no need to assemble or migrate data into a usable image before powering on a virtual machine. If the selected recovery point is not ideal, an alternate recovery point from the Cloud File System inventory can be easily selected and brought into service for validation. This capability reduces the amount of time spent when iterating over different recovery points, which results in overall faster recovery times.

Naturally, the restoration and use of data from an older recovery point means the organization will lose any changes that occurred to that data after that particular backup. The main objective is to find a valid recovery point that minimizes this data loss. This means finding the most recent copy of a virtual machine that can be remediated and used as the baseline for the recovery. The file and folder restore capabilities can be used to extract subsets of the virtual machines data set into this recovery point as needed to reduce overall data loss.

As with any restore process for data protection and disaster recovery, recovery point validation procedures should be well documented and practiced routinely to minimize recovery times when an actual ransomware attack occurs.

Recovery Point Iteration

One of the final activities of the response phase is to identify the recovery points that will be used for the actual recovery operations. The security team can use their tools to help identify recovery points available in VMware Cloud that are potential candidates, investigating if malware has infected the systems or if there are other concerns that could render the points unusable. The details of their investigations can be recorded in the recovery detail annotations as part of their record. Organizations can quickly iterate over alternate recovery point candidates, since they are mounted directly from the Cloud File System and can be quickly examined in the IRE.

Guided Workflow

VMware Ransomware Recovery provides a simple, well defined workflow for each VM to be processed. This workflow encompasses four basic steps as shown in the figure below:

image-20230804090845-1

When a Recovery Plan is activated for Ransomware Recovery, the VM inventory, as defined by the members of the Protection Groups used in the plan, is determined and placed into the VM list starting in the backups stage, ready for processing through the workflow.

The first step of the guided workflow has already been discussed above – selecting the initial recovery point for validation. During the remainder of the guided workflow, there are several other activities that are involved before a VM can be properly staged for recovery.

Multi-VM Selection and Batch Level Processing

While working through a ransomware recovery plan, the Guided Workflow has always allowed individual / per VM operations. The solution now supports multi-VM or bulk actions for most workflow actions. This enhancement helps streamline recovery activity at scale.

When the Recovery Plan is run for ransomware recovery, the total set of VMs in the inventory is identified and organized into the VM list for processing through the Guided Workflow. The VM list could contain dozens to hundreds of VMs to be processed for complete recovery. A small example of this VM list when first starting recovery is shown below:

image-20230804090933-2

At the beginning of the recovery process, each VM starts out in the Guided Workflow in the “In backup” stage. The goal is to move the VMs through validation and staging as quickly as possible and finally recover them back into the production site as shown in the Guided Workflow diagram below:

image-20230804091002-3

At the beginning of a recovery effort, it may make sense to begin by recovering one VM at a time. This approach would apply if you were still trying to determine the best snapshots to recover or otherwise determine when “good” and recoverable VMs exist.

Managing Bulk Operations

Once a key recover point in time is identified, groups of VMs can now be recovered into the Isolated Recovery Environment (IRE), from that designated Recovery Point – in bulk. Note that for VMs in different PGs, different Recovery points per PG can be specified for the bulk operation.

From the VM list, you can search for VMs in the current running Recovery Plan inventory based on different criteria and apply appropriate next actions on that collective set of VMs at the same time as shown below.

image-20230804091035-4

The selection criteria are based on VM name, Protection group membership, and the current guided workflow Recovery state. Each of these filtering constraints can be used to help identify the right subset of VMs to perform the next bulk operation on.

The resulting VM list can also be sorted based on other recovery characteristics of each VM. This sorting makes it easier to identify and select VMs to apply the next desired recovery action.

Note that the next action is also context sensitive to the currently selected VMs and is only presented if it is an allowed operation. For example, VMs that are in multiple different current states (e.g., Backup, Validation, or Staged) cannot be transitioned to the Recovered state – that only makes sense for Staged VMs. However, for VMs in any state, you can easily apply Recovery Point badges or change the Network Isolation levels of those VMs.

The guided workflow management allows for many concurrent operations across the VMs in the recovery set. As the recovery task scales, so does the workflow management. Concurrent bulk operations are also available. Note that in the current release, bulk operations are limited to 50 VMs per selection per action – additional action operations will be queued and serviced as bulk processing proceeds.

Security and Vulnerability Analysis

As previously mentioned, organizations need to proceed with recovery on the assumption that backup data is also infected with ransomware. As a result, they cannot simply recover an older copy of an infected virtual machine back into production, for fear that it will reinfect the environment. Instead, they will need to ensure the data and applications are clean before moving the virtual machines back into production.

The ideal solution and environment for conducting this analysis and remediation activity will include the following items:

  • An IRE(s) to stage recovered workloads into a controlled and safe environment for remediation
  • A virtual machine recovery solution that has robust retention capabilities to give the organization multiple recovery points that can be quickly deployed in the IRE(s)
  • Tools to detect and remove a wide variety of ransomware variants from workloads both statically and during operational (running) conditions
  • Documented procedures to help teams quickly:
    • Locate valid recovery points for remediation
    • Extract data from a recovered workload
    • Iterate on the restore process

NOTE: To ensure organizations benefit from the latest list of known ransomware, CBC will use updated signature lists at the time of recovery to detect any new variants that might not have been known when the backup was originally performed. CBC also includes malware behavioral detection to further improve its ability to uncover and address new strains of file-less ransomware.

Behavioral Analysis

Once the selected recovery point is brought into the IRE and instrumented with the security sensors, it will begin running, constrained by the default network isolation level of “Quarantined + Analysis” settings of the integrated NSX Advanced Firewall settings. In addition to the vulnerability and malware signature scans, the VM will be monitored with behavioral analysis. The vulnerability and malware scans are fixed tasks based on the current content and configuration of the VM. The behavioral analysis is a bit more open ended and should run for a period of time and under varying conditions. These variations could be in network isolation changes and interoperability with other VMs within the IRE.

Guest File Restore

When a limited amount of data (files) from an alternate recovery point is desired, the files can be extracted from Cloud File System recovery points with guest file restore utilities. These utilities allow a recovery administrator to view the VM contents of specific recovery point snapshots and extract specific data sets from those backup images into a local download location or a cloud based (S3) location, without having to bring the VM into the SDDC/IRE inventory. The extracted files are presented as a zip archive that can be downloaded, unpacked, validated, and merged into the working recovery copy of the VM.

The exact mechanics of guest file restore and subsequent data validation (cleaning) tasks varies from case to case. Sometimes it can be useful to have an intermediary IRE VM (e.g., one for Windows and one for Linux guests) equipped with all the required ransomware utilities, stationed in the IRE for initial data extraction and preparation. From here, the data can be safely merged into the baseline VM with significantly less risk.

Badges

During the validation workflow steps, either testing or actual recovery, it is possible to assign a predefined badge to the specific recovery point under consideration. The badge is a user applied attribute for the specific recovery point. The predefined values for the snapshot badge are shown in the figure below.

image-20230804091114-5

 

As was noted in the recovery plan testing section, applying these badges to snapshots evaluated during testing or practice runs will end up in the snapshot timeline (covered in the Recovery Point Selection section) and provide more information and guidance to the rest of the team when dealing with actual recovery tasks.

User Annotations

During the iterations of the workflow for an active recovery plan, the recovery administrator can add notes to the workflow for each iteration conducted. These notes improve communications, status and results across the team and further facilitates a comprehensive divide-and-conquer approach to scaling the recovery tasks.

Network Isolation / Firewalls

When a VM is initially brought into the IRE and started up, and the Carbon Black Cloud (CBC) sensors are enabled, it begins in a “Quarantined + Analysis”. In this state, the VM can only access or be accessed by the NGAV tools – all other north/south or east/west network traffic is limited through NSX Advanced Firewall rules. With the integration provided in VMware Ransomware Recovery, the network isolation control is easily managed with “push-button” selection options.

This short video on Network Isolation goes into more details about the network isolation capabilities available and the leverage of the NSX Advanced Firewall capabilities.

Final Recovery

Stage the Recovery Point

Once the team – both infrastructure and security – has completed the validation of the VM and performed the necessary patching, remediation, data updates, and malware removal, then that instance of the VM can be staged for final recovery back to the protected site. A new staging snapshot can be created in the Cloud File System that contains any updates to the VM made during validation and also has the security sensors removed to avoid any conflict when recovered.

Prepare the Protected Site

One objective of any recovery from cloud-based restores is to minimize the cloud egress impact — both in cost and transfer time. To support this, VMware Cloud DR and VMware Ransomware Recovery will attempt to restore the original protected site VM back to the same recovery point that was used for the recovery task. To make this functionality possible, organizations are advised to not delete the VMs from the original protected site. Instead, they are asked to simply power them off once the recovery begins, or sooner, if warranted for site security.

If for some reason, the original site is unavailable for the final recovery step, an alternate recovery site can be supported. In this case, the configuration of the alternate site will need to be mapped the same as the original site.

NOTE: If an alternate site is used for the final recovery, there will be additional time (and bandwidth) involved as the baseline VM data will not be present and need to be transferred from the Cloud File System to the alternate site.

Run the Recovery workflow process

The final step of the guided recovery workflow for each VM is to recover that staged VM from the Cloud File System back to the original site. The orchestration of the failback activities can be monitored from the UI.

Once all the recovery plans used to process the VMs to the VMware Cloud on AWS IRE have been completed, everything should be back to its original protected site (production) location. Organizations can then use the IRE for forensics, as needed, or simply begin cleaning up the remaining items, such as the temporary VMs and other services, and decommissioning or scaling down the VMware Cloud on AWS SDDC environment.

Summary

Ransomware attacks cannot always be prevented and they can bring business operations to a halt. As covered in this blog, a new way of handling the recovery from these types of disaster is needed. In this guide, we have gone over the basics of how VMware Cloud DR and VMware Ransomware Recovery can be used to provide a last line of defense in the fight against ransomware. These two solutions, combined with VMware Cloud on AWS and VMware Carbon Black, can provide organizations with a more robust and easier to manage framework to help recover from these types of disaster.

Not only is ransomware here to stay, but it will also become more sophisticated, frequent, and complex over time. VMware Ransomware Recovery helps customers:

  • Confidently recover in the face of existential threats
  • Quickly recover with guided automation
  • Simplify recovery operations with integrated availability, security and networking

Terminology

  • Cleaning – actions taken on guest VM to alter its state by removing, replacing, or blocking of malware so it is acceptable to put back into service
  • Clean VM – state of the guest VM – ready to put back into service
  • Compromised VM – state of the guest VM – malware condition still exists
  • Immutable – not capable of or susceptible to change – https://www.merriam-webster.com/dictionary/immutable
  • IRE – Isolated Recovery Environment – a controlled, “safe” environment to conduct recovery operations without the risk of re-infection from the malware
  • NGAV – Next Gen Anti-Virus software – including malware, vulnerability and behavioral analysis capabilities
  • Operationally Air-gapped – the separation of sites that do not have active networks connecting them – https://blogs.vmware.com/virtualblocks/2021/09/28/operational-air-gaps/
  • Recovery Granularity – the unit of recovery needed to bring a service back into operations – this could be at the file level within a VM, the whole VM, or a collection of VMs
  • Validating – determining the usable state of the guest VM – no actions taken – either clean or compromised

Filter Tags

DRaaS Carbon Black Cloud Disaster Recovery NSX Disaster Recovery VMware Cloud on AWS Document Technical Guide Overview Intermediate