VMware Cloud on AWS: SDDC Network Architecture
The networking architecture of an SDDC is arguably one of the most complex components of VMware Cloud on AWS. While it isn’t necessary to fully understand all aspects of the internal details of this architecture, it is important to have an understanding of fundamentals in order to properly design and manage the solution.
The below sections will discuss the network architecture of an SDDC both from the perspective of the AWS underlay network, and from the perspective of the NSX overlay network provided by each SDDC.
The AWS Base Layer
The VMware AWS Account
As part of the partnership with AWS, VMware maintains a master AWS account which is used as part of the VMware Cloud on AWS service. Whenever a new cloud services Organization (Org) is created, VMware creates a sub-account within this master account which acts as the parent for all AWS resources used by that Org.
Whenever an SDDC is provisioned, resources for that SDDC are created within the AWS sub-account for the Org. This model allows VMware to manage billing for SDDC consumption (hardware, bandwidth, EBS, etc.).
The SDDC Underlay VPC
As part of the process for provisioning an SDDC it is required to provide the number of hosts for the SDDC base cluster and an IP address range to use for SDDC management. Using this information, VMware will create a new VPC within the VMware-owned AWS account for that SDDC’s Org. This VPC will be created using the SDDC management IP address range provided during the provisioning process and several subnets will be created within that VPC. VMware will then allocate hardware hosts for the SDDC and connect them to the subnets of that VPC.
An IGW and VGW will also be created for this VPC. These gateways enable internet and Direct Connect connectivity to the VPC.
The SDDC Underlay
Once the underlay VPC has been created for the SDDC and hardware hosts have been provisioned, the SDDC is bootstrapped with ESXi and the remaining VMware software stack is installed. Once the SDDC is up and running, it is possible to get a glimpse of the underlay networking design by viewing the networking configuration of a single ESXi host. This setup is fairly complex, so it is not entirely necessary that it be well understood; however, having some insight into the inner workings of the underlay network will help when it comes to understanding the networking behaviors of the SDDC.
The hosts of an SDDC are provisioned with an Elastic Network Adapter (ENA) which provides their connectivity to the AWS underlay VPC. Within the host, this ENA is attached to an internal N-VDS host-switch where it is made available to the various virtual switches which provide network connectivity to the SDDC. The hosts themselves have a number of VMkernel interfaces which they use for the following purposes:
- management (vmk0)
- vSAN (vmk1)
- vMotion (vmk2)
- AWS API (vmk4)
- NSX Tunnel End Point (not shown)
These VMkernel adapters are visible from the networking section of the configuration tab of the ESXi host. You can also get a sense of the setup by viewing the TCP/IP configuration for the host.
The last portion of the underlay network are the virtual switches. Again, these are visible from the networking section of the configuration tab of the ESXi host. Here, you will see a mix of switches; some that are part of the underlay network and others which represent NSX network segments. The switches which are part of the underlay network are there to provide connectivity to the various management appliances of the SDDC. Of particular note is the management switch, which exists in order to provide the appliances with access to the management network. You can get a sense of some of the other important virtual switches in a host by examining their names as well as the management appliances which are attached to them.
The AWS infrastructure is completely unlike a traditional switched network in that it is not based on MAC-learning. Instead, 100% of all IP/MAC pairs must be explicitly programmed by AWS into the infrastructure. This presents a problem for the SDDC; specifically, when it comes to vMotion. Although the exact nature of the problem is beyond the scope of this document, it is sufficient to understand that each ESXi host utilizes a series of kernel-level (non NSX-managed) virtual routers designed to enable vMotion on top of AWS. These virtual routers are visible in the network path of a VM whenever you perform a traceroute. If you execute a traceroute then you will notice that the interconnects between the NSX edges and the host-level routers (vDR) are utilizing a mix of IPv4 addresses from the reserved ranges for link-local and carrier-grade NAT.
The VPC Cross-Link
Every production SDDC must be cross-linked to a VPC within the customer-owned AWS account. This cross-linking is accomplished using the Cross-Account ENI feature of AWS and creates a connection between every host within the base Cluster of the SDDC to a subnet within the cross-linked VPC. This cross-link provides the SDDC with a network forwarding path to services maintained within the customer-owned AWS account. The Availability Zone (AZ) of the cross-link subnet will be used to determine the AZ placement of the hosts.
The Cross-Account ENIs are visible from the customer-owned AWS account (by viewing network interfaces within EC2). You will notice that there are several ENIs created when the SDDC is deployed but not all are active. In addition to ENIs created for active hosts of the SDDC, there will also be ENIs created for future expansion and for upgrades/maintenance. Even though it is possible, you should avoid modifying or deleting these ENIs since doing so may impact the cross-link to the SDDC.
Note - The cross-link architecture is slightly different for stretched cluster SDDCs (covered in later sections).
The SDDC Overlay
An Overview of NSX Networking
As part of the standard SDDC software stack, VMware utilizes NSX-t to create an overlay network atop the base-layer provided by AWS. The end result is a logical network architecture which is completely abstracted from the underlying infrastructure.
Network overlays operate on the notion of encapsulation; they hide network traffic between VMs within the overlay from the underlying infrastructure. Networking is full of examples of overlay networks. Older protocols, such as GRE and IPSec ESP, have been around for years and were designed to create network overlays (typically over a WAN). With the introduction of software defined networking, specialty protocols such as VXLAN, STT, and NVGRE were created to help alleviate some of the limitations of VLAN-based data center networks. In recent years a newer overlay protocol known as GENEVE was introduced to address limitations with the first round of data center overlay protocols. NSX-t uses GENEVE as its overlay networking protocol within the SDDC.
Higher-level constructs aside, software defined networking defines 2 types of objects: logical switches and logical routers. These objects are designed to mimic the behavior of their counterparts in traditional hardware-based networks. Logical switches operate at layer-2 and will forward traffic between nodes within the same network segment. Logical routers operate at layer-3 and will route traffic between network segments. With NSX, logical switches and logical routers are distributed. This means that each host of the SDDC maintains enough information to understand which VMs belong to which logical switch(es) and how to forward traffic through the underlay network. When 2 VMs communicate with one another, the exact path through the underlay becomes a function of where the VMs reside (i.e. on which host they reside).
In the figure above, we see several examples of different types of traffic flows. As illustrated, intra-segment traffic will either be switched locally for VMS on the same host or, for VMs on different hosts, encapsulated and sent through the underlay. Similarly, for inter-segment traffic the routing will take place locally before being switched or encapsulated and sent through the underlay.
The NSX Overlay Network
The SDDC utilizes NSX to create an overlay network with 2 tiers of routing. At the first tier of the network is an NSX tier-0 router which acts as the north-south border device for the entire SDDC. At the second tier are the NSX tier-1 routers which are known as the Management Gateway (MGW) and Compute Gateway (CGW). These tier-1 routers act as the gateways for their respective networks.
There are two distinct types of networks within the SDDC: the management network and the compute network. The management network is considered to be part of the SDDC infrastructure and provides network connectivity to the various infrastructure components of the SDDC. Due to the permissions model of the service, the IP address space is fixed for this network and its layout may not be altered. The compute network is used by the compute workloads of the SDDC. Customers have the ability to add and remove network segments within the compute network as needed.
NSX Logical Routers
All routers within the SDDC are distributed. This means that routing between segments is performed locally on ESXi host by the appropriate distributed router (DR). Certain functions, however, are not distributed and must be handle centrally by a Service Router (SR) component on the NSX edge appliances. Specifically, gateway firewall and NAT operations are handled in a centralized manner. Also, any traffic which passes between the overlay and underlay networks must be handled by the tier-0 edge SR on the edge appliances. The edge appliances are deployed in a redundant pair, with all SRs running on the active appliance and with SRs on the standby appliance sitting idle. In the event of a failure of 1 or more SRs on the active appliance, all SRs will fail over to the standby appliance. This is an optimization strategy designed to prevent unnecessary traffic flows between the edge appliances.
In the figure above, we see an example traffic flow where a VM is communicating to the internet. It this example, the traffic is routed between the CGW and Edge DRs locally before being sent through the underlay to the Edge SR in the active edge appliance. From there, the traffic is routed through to the vDR on the local ESXi host before finally being sent out to the internet. This routing pattern would be visible in a traceroute from the VM.
The VPC Cross-Link to the SDDC Edge
As discussed previously, every host of the SDDC is connected to a VPC within the customer-owned AWS account via Cross-Account ENI connections. These connections are there in order to provide a forwarding path to the tier-0 edge of the SDDC.
Since we must always pass through the Edge SR whenever traffic leaves the SDDC, all traffic to and from the cross-linked VPC must pass through the active edge appliance of the SDDC. Since this edge appliance is a VM, it resides on a specific ESXi host. As such, it will always use the Cross-Account ENI for that host (as well as the local vDR for that host). It should be noted that the hosts of the SDDC will be deployed within the same Availability Zone (AZ) as the cross-link subnet (the subnet which contains the Cross-Account ENIs). This practice not only provides a control mechanism for AZ placement of the SDDC (customer chooses the subnet and thus controls the AZ placement), but also serves to eliminate cross-AZ bandwidth charges between the edge and any native AWS resources within that same AZ.
Note - The cross-link architecture is slightly different for stretched cluster SDDCs (covered in later sections).
Routing between the SDDC and the VPC is enabled through static routes which are created on-demand as networks are added to the SDDC. These static routes are added to the main routing table of the customer VPC and use one of the Cross-Account ENIs as the next-hop for the route. It is important to keep in mind that the next-hop ENI used for the static routes will always be that of the ESXi host which houses the active edge appliance of the SDDC. This means that if the edge were to migrate to a different host (as happens during a fail-over event or whenever the SDDC is upgraded) then the next-hop of the static routes will be updated to reflect this change. For this reason, it is not recommended to manually copy these static routes to other routing tables of the VPC.
Edge Uplinks and Traffic Flows
It is important to understand the various uplinks in the SDDC and what traffic flows through them. This information is useful not only for understanding inter-connectivity within the SDDC, but also in understanding how traffic exits the SDDC (and potentially incurs bandwidth charges). There are currently 3 uplinks from the tier-0 edge of the SDDC. These are described below.
Note - There are additional traffic flow considerations for stretched cluster SDDCs (covered in later sections).
The internet uplink provides the SDDC with internet connectivity via the IGW within the underlay VPC. The SDDC edge has a default route which points to the IGW as a next hop, so will use this uplink for all unknown destination networks. Traffic over this uplink is billable and the charges will be passed through as part of the billing for the SDDC.
The VPC uplink connects the SDDC edge to the cross-linked VPC in the customer-owned AWS account. There is a static route on the SDDC edge for the private address space of the VPC which points to the VPC router as a next hop. The SDDC administrator may also enable static routing of certain public AWS services (e.g. S3) over this uplink. Traffic over this uplink is non-billable only for AWS resources which are within the same availability Zone as the SDDC. Traffic to resources in other Availability Zones is billable and charges will be accrued on the customer-owned AWS account.
Direct Connect Uplink
This Direct Connect uplink is only used when Direct Connect private VIF is plumbed into the SDDC. The SDDC edge will use this uplink for whatever network prefixes are received via BGP over this uplink. Since Direct Connect is a resource which is managed by the customer-owned AWS account, bandwidth charges over this uplink will be accrued on the customer-owned AWS account.
Stretched Cluster Networking
A stretched cluster SDDC is one in which the hosts of each Cluster are split evenly between two Availability Zones (AZ) of an AWS Region. In order to ensure availability, the SDDC must be able to freely migrate workloads between AZs whenever a significant outage is detected in one AZ or the other.
When migrating workloads in a traditional network, one of the biggest challenges is with maintaining network connectivity post-migration. For compute workloads within an SDDC, NSX reduces this challenge through its use of network overlays that serve to abstract the underlying infrastructure. However, SDDC management appliances are a special case. Since they have a direct presence on the underlying Subnets of the VPC, special steps must be taken to ensure that management appliances retain their network connectivity should they migrate between AZs.
Since Subnets are bound to a particular AZ, it is not possible for a management appliance to be migrated between AZs and maintain connectivity to its original Subnet(s). For this reason, stretched cluster SDDCs will be deployed with dual sets of Subnets which are designed to provide network connectivity to management appliances in the event that they migrate between AZs. The management appliances are issued “floating” IP addresses which, when combined with the routing functionality of the host-level vDR, enables them to transparently migrate between AZs while maintaining their network connectivity. Note that these “floating” IP addresses are taken from the management address range of the SDDC and are not to be confused with public floating IP addresses used within native AWS.
Stretched Cluster VPC Cross-Link
VPC cross-linking is another example of the networking challenges the SDDC must deal with in a stretched cluster design. As a standard part of the SDDC deployment, the hosts of the base Cluster are crossed-linked to a Subnet within the customer-owned VPC. This cross-linking is performed in order to provide the SDDC edge with a forwarding path to that VPC. This forwarding path enables the SDDC to access AWS services within the customer AWS account.
As with all Subnets, the cross-link Subnet is bound to a single AZ. This means that if the edge appliance migrates between AZs then it will lose access to this Subnet. The solution to this problem is to simply require a pair of cross-link Subnets to be provided whenever a stretched cluster SDDC is deployed. These Subnets must reside within separate AZs and, like a standard SDDC, they will control AZ placement of the hosts within the SDDC.
In the event that the edge appliance migrates between AZs, the routing table of the customer-owned VPC will be updated to route traffic through the appropriate cross-link Subnet.
Stretched Cluster Traffic Flows
In addition to the network flows discussed previously, stretched cluster SDDCs come with an additional set of behaviors which must be kept in mind when designing with this feature. Specifically, you must consider both the cross-AZ bandwidth charges that will result from a stretched cluster design as well as the reduced network throughput for traffic which passes between AZs. In particular, the following traffic flows should be kept in mind.
East-west communications between the workloads of an SDDC whenever these workloads reside in different AZs.
North-south traffic for workloads which reside in a different AZ than the NSX edge appliance. This includes all traffic which must pass through the edge, including internet, Direct Connect, and traffic to the cross-linked VPC. Cross-linked VPC traffic is particularly complex since the traffic could easily cross Availability Zones twice depending on the AZ of the destination AWS resource. Additionally, this traffic profile could easily reverse itself if the edge were to migrate between AZs.
vSAN Replication Traffic
Traffic for vSAN synchronous writes between the hosts of a stretched cluster SDDC will result in cross-AZ traffic flows. The amount of traffic heavily depends on the storage I/O profile of your workloads.
Authors and Contributors
Author: Dustin Spinhirne