Solution

  • General

Type

  • Document

Level

  • Overview

Category

  • Feature Brief

Feature Brief: Elastic DRS

Introduction

One of the great things about the VMware Cloud on AWS service is that it’s operated and managed by VMware, taking the infrastructure operations burden away from the customer. VMware does this by not only managing the hardware lifecycle and remediation, but also through managing certain aspects of the vSphere environment such as Availability and DRS. One feature that further extends availability and resiliency of the SDDC cluster, which is exclusive to VMware Cloud on AWS, is Elastic DRS (eDRS). 

How it Works

Elastic DRS allows you to scale your cluster in response to demand, or lack of demand, by adding or removing hosts automatically based on specific policies that are configured. The eDRS algorithm runs every 5 minutes and looks at predefined resource thresholds for CPU, memory, and storage. The thresholds cannot be changed by the user and differ based on the policy configured. While the algorithm runs every 5 minutes, the scaling decisions also take into account trends that are tracked over time. If ANY of the resources consistently remain above the defined threshold, a scale-up recommendation alert is generated, and a host is added to the cluster. Conversely, a scale-down recommendation alert is only generated when ALL resources are consistently below the threshold, triggering the removal of a host. 

Policies

By default, the Scale Up for Storage Only policy is configured for every cluster deployed within your SDDC. The maximum usable capacity of your vSAN datastore is 75%; when you reach that threshold, eDRS will automatically start the process of adding a host to your cluster and expanding your vSAN datastore. Please note that even if you free up enough storage to fall below the threshold, the cluster will not scale-down automatically. You will need to manually remove host(s) from the cluster. 

Other policies available include Optimize for Best Performance and Optimize for Lowest Cost. In these scenarios, the eDRS algorithm will look at the minimum and maximum hosts you’ve specified for your cluster size and take that into consideration with resource consumption. Optimizing for performance adds hosts quickly and removes them slowly to ensure the best possible performance; while optimizing for lowest cost removes hosts quickly and adds hosts slower to keep costs to a minimum. 

Rapid Scale-Out is configured to react faster and to add hosts in parallel, up to four at a time, to allow a cluster to scale-out more quickly during an event. Some primary use cases that can benefit from this new policy are disaster recovery events, significant VDI power-on events, or even bulk workload migration/power-on events. The Rapid Scale-Out maximum resources thresholds are the same as the EDRS Performance Policy thresholds, but the minimum thresholds are set to 0%. This allows for a scale-out task to kick off quicker but also means we will not automatically scale-in; scale-in will be a manual process driven by the customer.

The resource thresholds differ based on the policy you configure. 

EDRS Policy

CPU Thresholds

Memory Thresholds

Storage Thresholds

Rapid Scale-Out

High: 80%, Low: 0%

High: 80%, Low: 0%

High: 70%, Low: 0%

Performance

High: 90%, Low: 50%

High: 80%, Low: 50%

High: 70%, Low: 20%

Cost

High: 90%, Low: 60%

High: 80%, Low: 60%

High: 70%, Low: 20%

Storage Only

N/A

N/A

High: 70%, Low: 0%

 

Enabling the Policy

To set the policy of your choice, edit the EDRS settings of your cluster, and choose the new policy.

 

A screenshot of a social media post

Description automatically generated

Safety Checks and Notifications

There is a safety check built-in, so we aren’t continuously adding or removing hosts; we want the cluster to “cool off” and the resources to level out. There is a 30-minute delay between scale-up events, and a 3-hour delay to trigger a scale-down event after a scale-up event. 

When scaling recommendations are generated, the multi-channel notification service will send out automated notifications via email to organization members. 

A screenshot of a social media post

Description automatically generated
 

And via the console:


 
A screenshot of a cell phone

Description automatically generated
 

Information is also tracked in the Activity Log:

A screenshot of a cell phone

Description automatically generated
 

Lastly, more detailed tasks are tracked within the web client:

A screenshot of a computer

Description automatically generated
 

As you can see, there’s certainly no shortage of notifications when it comes to scaling your clusters. Customers can also subscribe to the notification webhook for the events.

In the end, you have the scalability and flexibility you expect from a Cloud service to maintain availability, capacity, and performance.

 

Filter Tags

  • General
  • Overview
  • Feature Brief
  • Document