VMware Cloud Well-Architected Framework for Azure VMware Solution: Infrastructure Operations and Service Control
Day-to-day Infrastructure Operations
With the adoption of VMware Cloud, it’s expected that several of the organization’s existing operational processes will remain consistent and intact in the public cloud. This is part of the value proposition of the service. It is critical for an operations team to understand the VMware Cloud management domains in detail to cohesively evaluate the capabilities and identify any nuances or new areas that could introduce operational gaps.
The day-to-day management of VMware Cloud will typically focus on the following areas:
Updates to the SDDC software are necessary to maintain the health and availability of the VMC service. The VMware Cloud provider will share notifications of upcoming SDDC lifecycle management activities. Customers will monitor, review, and manage the patching and upgrade schedule to the SDDC (review release notes and provide forward looking change notifications to internal teams). This includes vSphere ESXi, vSAN, NSX, and VMware management components.
The organization must review existing policies and processes for lifecycle management and how they can be adapted for VMware Cloud. This includes the scheduling of VMware Cloud patching and updates, version control between on-prem and VMware Cloud for compatibility, identification of components not included in the scheduled VMware Cloud patching processes and defining validation processes applied before and after patching has occurred.
VMware Clouds are provisioned with VMware vCenter Server. After deployment customers are provided with credentials to manage the core vCenter Logical Constructs (Folder Structures, Alarms, Tags etc), Roles and Responsibilities, and the vCenter Content Library.
Review existing processes for managing vCenter and how this can be adapted for VMware Cloud. This includes the management of Content Libraries, core vCenter logical constructs (Datastores, Alarms, etc) and vCenter integration between on-prem and VMware Cloud (if supported).
Control Plane Management
The control plane plays a primary role in acting as the interface between applications and service delivery. This area will include event data collection, resiliency & backup strategy, configuration and access management, as well as Day 2 operations.
Review existing processes for control plane management and how they can be adapted for VMware Cloud. This should include core VMware based control plane technologies (vROPs, vRLI, vRNI, HXC, vRA) that will be integrated with the VMware Cloud SDDC. Consider the following associated processes; lifecycle management, control plane backups, resiliency, and event management.
Compute & Workload Management
This area focuses on the processes and policies for managing compute resources and deployed workloads. This includes VMs, Containers, Resource Pools, vApps and Compute Management policies.
- Define VMware Cloud processes for managing vSphere clusters, including host provisioning, child resource policies (naming, shares, reservations, reservations, limits), and integration with core IT Service Management systems (If required). This should also include the decisions & process to trigger compute scaling.
- Review existing processes for workload management and how they could be adapted for VMware Cloud. This includes VM templates, Container images, snapshots, clones, OS patching, licensing compliance and considerations, antivirus, VM tool management and any migration considerations.
This area focuses on the management of datastores, storage encryption, storage monitoring & optimization, storage performance & capacity management, and operationalizing vSAN within VMware Cloud.
- Review existing processes for managing storage, including vSAN storage policies, encryption, storage performance approach, vSAN threshold requirements, and proactive capacity management.
- How will you monitor vSAN storage usage and capacity to comply with the service SLA?
Management of NSX, configuration of network segments, IP address management, management of network security policy, and monitoring/managing network traffic levels.
Review existing network and security processes and how they can be adapted for VMware Cloud. This includes network and security architecture/design, security models (current and cloud optimised), network stretching, security requirements, micro-segmentation, distributed firewall, IP address management, roles/responsibilities and lines of demarcation (between your teams and vendor ecosystem), end to end traffic flow monitoring, proactive bandwidth optimisation and load balancing requirements.
Performance & Capacity Management
Performance and Capacity are tightly linked as they both ensure your workloads get the necessary resources to perform optimally. Consider the following topics for adopting a successful strategy:
- Identify key stakeholders and confirm the scope and metrics for capacity and performance (Compute, Storage, Network). This will allow you to then effectively monitor the utilization of your workloads and identify thresholds for scale-out expansion.
- Regularly collect performance data and periodically review the metrics. By identifying cyclical patterns, you can recommend cloud infrastructure scaling ahead of demand. This would not just lead to effective operations of your workloads but also to potential cost optimization as you build synergy between your workload elasticity and VMware Cloud purchasing strategy.
- Identify key tools/technologies that will be used to monitor and manage performance and capacity. Make sure that these tools are capable of operating with the permission set provided within the VMware Cloud deployment.
- Whenever possible, establish baseline performance before, during and after migration to VMware Cloud. This helps to isolate any performance issues that might arise post-migration.
- Establish roles and responsibilities for VMware Cloud Performance management ahead of any planned migrations.
Availability management plays a lead role in ensuring your services can perform their agreed functions to meet the needs of the business.
The availability of your services will depend on the percentage of time that your workload is available. This percentage (such as 99.99%) will be reflected over a period and is often a design goal for applications.
As you operate your VMware Cloud workloads, consider the following topics to ensure that you have established availability processes extended to the SDDC:
- Develop a plan for continuously balancing/distributing applications/workloads across available cloud resources. Utilize anti-affinity policies whenever necessary.
- Work with application teams to ensure that applications are architected with the availability model of the cloud in mind.
- Review VMware Cloud Availability Commitments to ensure that these are well understood by your operations teams.
The Hosting Reliable Applications on VMware Cloud whitepaper provides high level strategic guidance on the design of highly available and reliable VMware Cloud infrastructure.
Infrastructure Observability & System Health
The goal of observability is to understand a complex system’s internal state by observing its external outputs. Proper instrumentation enables you to aggregate metrics, traces, logs and events from a distributed system and correlate them across various application components and services, identifying complex interactions between elements and allowing you to troubleshoot performance issues, improve management, and optimize cloud native infrastructure and applications.
Underlying Hardware Infrastructure & VMware Control Plane
VMware Cloud was designed from the ground up to be simple to consume, allowing your operations teams to focus higher up the stack and away from the undifferentiated heavy lifting associated with hardware infrastructure.
This means, the management, health, and lifecycle of the underlying hardware infrastructure (compute, network, and storage) is the responsibility of the VMware Cloud Provider. This includes lifecycle management of the VMware component stack and the adding and removing of physical hosts for scaling and maintenance purposes.
From your perspective as a consumer of the platform, your operational responsibilities and overall observability capabilities now begin at the VMware software layer, which includes ESXi, NSX, and vSAN. Capturing the appropriate metrics from these components in addition to your enterprise workloads will form the building blocks for achieving full-stack observability.
When deciding on the metrics that need to be observed, it’s important to adopt a user centric approach that works backwards from application owners. The goal should be to collect the minimum number of data points necessary to implement observability in the most efficient possible manner. Choosing more metrics than necessary and you could experience alert fatigue and lower attention towards the statistics that matter. In contrast, not selecting enough metrics would be counter intuitive as it leads to lack of visibility and overall inability to examine significant behaviours.
This section will outline key considerations when building an observability plan for your VMware Cloud infrastructure. It is advised to think about your observability plan when you are in the pilot or pre-production stage of your cloud journey. Consider the following high-level guidelines:
- Shift towards an SLO centric culture to observe your services based on critical end-user experience rather than system metrics. Ensure VMware Cloud monitoring and event metrics/thresholds are aligned to Service Level Requirements and SLOs that are documented in Service Level Agreements with the service consumers (i.e., LOBs)
- Define and select the appropriate Infrastructure and Application metrics to create SLIs that help you achieve better system observability.
- All key thresholds and metrics formally established and reviewed regularly. The review process is documented and formally established, and reviews are fully aligned to service level requirements, and they support business commitments. There is a well understood and documented understanding of the bidirectional impact of VMware Cloud in addition to the future planning of new KPIs/Metrics to drive further efficiencies and user experience.
- Review your existing processes and tools used for monitoring and event management and how they could adapt to VMware Cloud i.e., Predictive analytics, guided troubleshooting, root cause analysis as well as policy-based, automated remediation capabilities. This will proactively protect you against degradation of performance and capacity.
Workloads operating in the VMware cloud need to consistently instrument the applications to emit metrics, logs, and traces so that the signals can be correlated to identify the root cause of any issue. These issues could relate to inaccessibility, operating system (OS) instability, application misconfiguration, or any number of other possibilities.
A well-designed system aims to have the right amount of observability that starts in its development phase. Don't wait until an application is in production before you start to observe it. This includes the setup of monitoring, alerting, and logging so that you can act based on the behaviour of your system.
Questions to consider when choosing instrumentation for VMware Cloud observability:
- What tooling will be used to monitor VMware Cloud and manage related events? Is this a new tool or will something existing be adapted?
- Do you require a system that supports multi-clouds, including on-premises?
- Is there an egress cost for sending data to the observability system?
- Should the system provide support for multiple regions?
- Should the system scale out on-demand for capacity?
- Should the system support multi-tenancy with separation of teams?
- Should the system include AI-powered intelligence to facilitate AIOps as you evolve your observability practice?
- Does the system need to support 3rd party integrations such as PagerDuty, ServiceDesk, DataDog, Slack and VictorOps?
- Does the system need to provide immutability (data/logs/metrics), which cannot be modified, deleted, manipulated? Access control is required.
- Does the system support scraping metrics from modern apps, or does it require an agent to be installed?