Feature Brief: Elastic DRS
Introduction
One of the great things about the VMware Cloud on AWS service is that it’s operated and managed by VMware, taking the infrastructure operations burden away from the customer. VMware does this by not only managing the hardware lifecycle and remediation, but also through managing certain aspects of the vSphere environment such as Availability and DRS. One feature that further extends availability and resiliency of the SDDC cluster, which is exclusive to VMware Cloud on AWS, is Elastic DRS (eDRS).
How it Works
Elastic DRS allows you to scale your cluster in response to demand, or lack of demand, by adding or removing hosts automatically based on specific policies that are configured. The eDRS algorithm runs every 5 minutes and looks at predefined resource thresholds for CPU, memory, and storage. The thresholds cannot be changed by the user and differ based on the policy configured. While the algorithm runs every 5 minutes, the scaling decisions also take into account trends that are tracked over time. If ANY of the resources consistently remain above the defined threshold, a scale-up recommendation alert is generated, and a host is added to the cluster. Conversely, a scale-down recommendation alert is only generated when ALL resources are consistently below the threshold, triggering the removal of a host.
The following demo shows EDRS in action:
Policies
By default, the Scale Up for Storage Only policy is configured for every cluster deployed within your SDDC. The maximum usable capacity of your vSAN datastore is 75%; when you reach that threshold, eDRS will automatically start the process of adding a host to your cluster and expanding your vSAN datastore. Please note that even if you free up enough storage to fall below the threshold, the cluster will not scale-down automatically. You will need to manually remove host(s) from the cluster.
Other policies available include Optimize for Best Performance and Optimize for Lowest Cost. In these scenarios, the eDRS algorithm will look at the minimum and maximum hosts you’ve specified for your cluster size and take that into consideration with resource consumption. Optimizing for performance adds hosts quickly and removes them slowly to ensure the best possible performance; while optimizing for lowest cost removes hosts quickly and adds hosts slower to keep costs to a minimum.
Rapid Scale-Out is configured to react faster and to add hosts in parallel, up to four at a time, to allow a cluster to scale-out more quickly during an event. Some primary use cases that can benefit from this new policy are disaster recovery events, significant VDI power-on events, or even bulk workload migration/power-on events. The Rapid Scale-Out maximum resources thresholds are the same as the EDRS Performance Policy thresholds, but the minimum thresholds are set to 0%. This allows for a scale-out task to kick off quicker but also means we will not automatically scale-in; scale-in will be a manual process driven by the customer.
The resource thresholds differ based on the policy you configure.
EDRS Policy |
CPU Thresholds |
Memory Thresholds |
Storage Thresholds |
Rapid Scale-Out |
High: 80%, Low: 0% |
High: 80%, Low: 0% |
High: 70%, Low: 0% |
Performance |
High: 90%, Low: 50% |
High: 80%, Low: 50% |
High: 70%, Low: 20% |
Cost |
High: 90%, Low: 60% |
High: 80%, Low: 60% |
High: 70%, Low: 20% |
Storage Only |
N/A |
N/A |
High: 70%, Low: 0% |
Enabling the Policy
To set the policy of your choice, edit the EDRS settings of your cluster, and choose the new policy.
Safety Checks and Notifications
There is a safety check built-in, so we aren’t continuously adding or removing hosts; we want the cluster to “cool off” and the resources to level out. There is a 30-minute delay between scale-up events, and a 3-hour delay to trigger a scale-down event after a scale-up event.
When scaling recommendations are generated, the multi-channel notification service will send out automated notifications via email to organization members.
And via the console:
Information is also tracked in the Activity Log:
Lastly, more detailed tasks are tracked within the web client:
As you can see, there’s certainly no shortage of notifications when it comes to scaling your clusters. Customers can also subscribe to the notification webhook for the events.
In the end, you have the scalability and flexibility you expect from a Cloud service to maintain availability, capacity, and performance.