Design Considerations

Infrastructure Dependencies & Design Considerations

Security always depends on context, and in most cases the context is influenced by how an organization intends to use their cloud presence. Access control, roles, permissions, network connectivity, and other security controls will differ between a VMware Cloud SDDC intended only as a disaster recovery site and a VMware Cloud SDDC running active production workloads. Determining the use case of your VMware Cloud SDDC and then documenting it helps teams and organizations make decisions about security, risk, and availability, as well as to help decide what needs to change if the scope of the deployment changes.

DNS Availability and Records

The Domain Name System (DNS) is crucially important to both workloads and the users of those workloads. DNS provides an abstraction layer so that IP addresses don’t need to be tracked and used directly, and more human-friendly names can be used. TLS certificates are also correlated with DNS, helping establish trust between systems within virtual infrastructures. As such, most organizations depend heavily on DNS, and require that their DNS servers be made highly available. Your organization’s intentions for an SDDC will determine the methods you use to ensure DNS availability.

VMware Cloud SDDCs provision resolvers and DNS records during deployment for use by the infrastructure. This ensures that the infrastructure is always reachable internally through the DNS names.

Ideas to consider:

  • DNS records help determine how network traffic flows. If a domain name resolves to an IP address internal to your organization then traffic will flow on internal links and VPN connections between sites. If a domain name resolves to a public IP address then traffic will flow across the public Internet.
  • Latency of DNS resolution has a profound impact on the performance of workloads and services, from response time to overall system load. Keeping DNS resolution as close to a workload as possible ensures the best performance.
  • Do the DNS servers you use provide authoritative name services for customer-facing and external services? If those systems are unavailable will your customers be unable to reach you?

“Split-brain” DNS methods, where one view of an environment is available to some clients, and another view is available to others, can be a useful tool for organizations. It also can be very confusing and lead to errors if there are multiple sources of authoritative information. The phrase “security through obscurity” was coined many years ago to describe the act of hiding things to secure them. This is not a legitimate approach to security. Omitting DNS records in the hopes that attackers will not find a system tends only to make cloud and virtual infrastructure administrators’ jobs much harder, for no actual security gain, and leads to outages and unnecessary complexity, as well as the inability to use service DNS entries to make services flexible and portable.

The method in which your organization implements DNS will determine how it can be made available in a hybrid or disaster recovery scenarios. Some DNS server software is fine with being cloned & replicated. Other software is not. One example of that is Microsoft Active Directory, where the best practices for deployments state that systems supporting an Active Directory should not be cloned, but instead installed as fresh deployments.

  • How will your workloads reach the DNS server? Does your organization employ “service IPs” for DNS that are highly available, moving between DNS servers, or an anycast routing scheme to direct traffic to the nearest DNS resolver?
  • How will workloads that move between an on-premises deployment and a VMware Cloud SDDC find their DNS resolvers? Will moving a workload require reconfiguring the network settings? How will reconfiguring many workloads during an incident affect your RTO?
  • Do you have your authoritative DNS servers and DNS recursive resolvers separated, according to best practices for DNS operations? If so, you may need to employ different availability and security methods for each type of server. In general, authoritative DNS servers should not answer recursive DNS requests from clients, and recursive DNS resolvers should not be accessible publicly.

Many cloud providers have DNS services that can be used across local clouds and global public cloud regions. This can simplify management with hybrid deployments and while migrating workloads between environments.

Do your systems permit access based on domain names? For example, many Linux systems are configured with hosts.allow and hosts.deny files that can contain either IP address ranges or domain names that are permitted. Similar configurations can be achieved with guest OS firewall rules, too. During an outage where DNS is potentially affected will authorized administrators be able to connect to systems to repair them?

DNS is a powerful system, and the hierarchy in it can be used to help track other information, such as where a workload is running, which makes IT operations easier. However, as workloads move between clouds other systems (like a Configuration Management Database) may need to be kept up to date, too. Location information in DNS can also leak information to attackers about where your organization’s facilities are. Security is always a tradeoff; ensure your organization is comfortable with the risks.

IP Addressing and Management

Organizations that migrate to the cloud find themselves with more complex IP address management. Decisions about addressing help determine the complexity of many other activities, especially failover planning, migrations, and firewalling. Working to simplify IP address allocations pays dividends in operational efficiency later.

Ideas to consider:

A GCVE Private Cloud requires specific network allocations for the management components. Does your IP allocation strategy for the cloud assume that your organization will always only have a specific number of SDDCs, a specific number of sites, or that the sites are all in the same region?

Separating and isolating infrastructure management interfaces is an important step towards making it hard for attackers inside your environment. Does your IP addressing system allow for infrastructure management interfaces to be isolated from clients, workloads, and other infrastructure systems?

Network Address Translation

Network Address Translation, or NAT, is a technique where multiple network addresses can be mapped to a single IP address. In most cases it is used as a way to circumvent IPv4 address exhaustion on the public Internet. Many network providers only supply a single IP address to their customers, thereby requiring the use of NAT. These IP addresses may also be dynamic, changing periodically as the provider reprovisions equipment and networks.

There are two types of NAT: source NAT and destination NAT. Source NAT is the type of NAT most users are familiar with, as it translates internal IP addresses to a single public IP. By default, source NAT is applied to outbound network traffic on workload network segments. Destination NAT is the inverse of that, and will translate public IPs to a single private IP address. Destination NAT is often used in conjunction with port forwarding to enable application access.

NAT is not a security control by itself. There are several ways to deanonymize network traffic that is obfuscated by NAT, and there are situations where unsolicited network traffic can be transmitted back through the NAT device, to probe networks and initiate attacks. Use of NAT should also be accompanied by use of firewalling technologies and rules. In Google Cloud that is the case, with NSX protecting traffic in all directions from the management and compute gateways.

Ideas to consider:

Organizations with a distributed workforce and/or distributed branch offices may need to consider the impact NAT has on access control? Does your organization authorize people or systems based on IP addresses subject to NAT?

Storage Availability and Security

Google Cloud VMware Private Clouds are built by default using VMware vSAN, using storage supplied by the cloud provider. When vSAN storage is configured it has vSAN Data-at-Rest Encryption and compression enabled. vSAN datastores in Google Cloud VMware Engine offer choices for storage availability, including across multiple hosts, stretched clusters, and the number of hosts.

Inside the cluster, the VM Storage Policies can be customized. VMware vSAN allows customers to choose the affinity and disaster tolerance in stretched clusters, as well as host failures to tolerate (from none to three). This allows an organization to balance capacity, performance, and space efficiency against their tolerance for risk, and their budget.

Ideas to consider:

Changes in vSAN storage policies require a resynchronization process, which is not instantaneous. If your organization customizes storage availability policies to improve capacity will there be a time, perhaps as part of a failover process, where storage availability will need to change? Is that process documented? Is the risk during the resynchronization process acceptable? Are the relevant policies preconfigured to ensure they are correct, saving time and avoiding errors during an already stressful incident?

Templates and Container Registry

GCVE Private Clouds support template management through the vSphere Content Library. The Content Library makes storing, replicating, using, and updating templates, ISO images, and other system artifacts easy. Customers migrating to GCVE, or running in a hybrid design model, can configure their on-premises Content Library as a replication source, and configure their SDDC’s Content Library as a consumer of that content. Not only does this enable day-to-day operations in the cloud, but also provides resilience for guest OS boot media and other recovery tools during an incident.

Virtual machine templates provide a straightforward way to deploy new VM-based workloads. There are many ways to manage templates, from completely manual processes to heavy reliance on automation tools like SaltStack. Automated methods of configuring a new virtual machine save time by guaranteeing consistency for system configuration, including software updates and security controls. In turn, this speeds audits and makes improving security easier. It also makes template management easier, because templates can be generic, customized at deployment time based on the current patch levels and system configurations in use.

Container based workloads rely on a container image to run. These container images are immutable, meaning that they cannot be changed and can be considered static. The nature of these immutable images means that there is no need to patch the images in place, but new container images should be built as new security vulnerabilities are identified with the individual components of the container image. Since the container images are rebuilt frequently, it is important to have a secure supply chain to ensure that no new vulnerabilities are unexpectedly introduced into an immutable image used by your container workloads.

Ideas to consider:

  • Does your organization store installation and recovery media for all guest OSes in the VMware vSphere Content Library? Is that library replicated to all the sites that might depend on it for incident response?
  • Does your organization have a process to regularly update content library content, templates, and other content stored as part of disaster recovery and business continuity processes? Are old template images removed in order to prevent redeployment of outdated and potentially insecure configurations?
  • Does your organization use a configuration management and auditing tool such as SaltStack to ensure consistency of deployed virtual machines, and make building new workloads easier and faster? Configuration management tools reduce the complexity of templates and container images by managing the configurations once a workload is deployed.

Backups and Restores

Backup systems are the last line of defense against incidents, especially security breaches involving ransomware. Incidents can also encompass less dramatic situations, such as a failed application upgrade or human error. Being able to roll back a workload to a known-good state is a powerful protection.

Breaches involving ransomware can be quite long, measuring hundreds of days from the initial breach to the containment of the breach. Attackers are patient and will work to ensure that an organization must pay the ransom. This often entails disabling or corrupting backups. Organizations must make it difficult or impossible for attackers to access backup systems.

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are important considerations for determining backup frequency and scope, as well as whether workloads should also be protected with other means, such as with replication.

Ideas to consider:

  • An attack occurring over a long period of time will cause your backup systems to capture the results of the attack on affected systems. This is an important consideration, because restoring a backup may also restore infected systems, and/or restore systems to a vulnerable state. How would you recover workloads in a questionable state, as well as how would you assess the reliability of such workloads?
  • Are your backup systems isolated from corporate authentication and authorization systems, such as a centralized Active Directory? If an attacker gains administrative access to the central directory what security controls will stop them from accessing, deleting, and corrupting backups and replicated copies of workloads?
  • Are workloads configured to separate operating systems, applications, and application data, so that if malware is found it might be possible to independently restore the data, remounting or reattaching it to a fresh installation of the application?
  • Have you documented the restore procedure for workloads? Do you rehearse it regularly to ensure that it works, and that staff understand it? Are all components and tools for the restore available if the original SDDC is not available?
  • Is it possible that you will need to restore your backups to a different availability zone or SDDC, following the loss of an SDDC or loss of access to an availability zone? Are your workloads able to have their public IP addresses renumbered? Do DNS entries have Time-To-Live values suitable for your desired RTO?
  • Would you be able to recreate NSX network segments to restore internal connectivity to applications? Do you have backups of firewall and NSX network segment configurations?

Management Interface Availability

Access to management interfaces and cloud console interfaces can be incredibly paradoxical. Organizations need to limit access to them, but at the same time allow access to authorized staff, possibly from unexpected but legitimate IP addresses and locations if an organization’s primary site is unavailable. This is where modern zero trust methods of authentication and access control are very helpful.

Many organizations employ bastion hosts or “jump boxes” to help control access to management interfaces. Additionally, some organizations, including VMware internally, use dedicated VMware Horizon VDI deployments to provide secure & trusted access to systems management tools and interfaces. Staff connect to these systems, then can interact with infrastructure from a known & trusted management workstation image.

Ideas to consider:

  • How will IT staff access cloud consoles and management interfaces to conduct recovery operations during an incident if the primary site is offline and potentially unrecoverable?
  • Are bastion hosts, jump boxes, and/or dedicated VDI instances patched quickly and proactively, to ensure that attackers cannot exploit new vulnerabilities to gain access?
  • Do management interfaces rely on authentication and authorization provided by a central directory, such as Microsoft Active Directory? Is that directory considered “in scope” for compliance audits? How does your organization protect against unauthorized changes by administrators of those systems, potentially allowing privileged administrator access to infrastructure systems?

Incident Response and Business Continuity

Organizations that make the shift to assuming that a breach will happen are the organizations that tend to be the most prepared if it happens. Making this assumption is an important change of mindset an organization needs, to combat ransomware and other types of attacks. Ensure that attacks are covered in your organization’s disaster recovery & business continuity planning. Plan for an “everything down” scenario. Most importantly, organizations should be proactive in engaging a security consultancy that specializes in incident response. Incident response is a separate function from business continuity planning, but crucial for understanding how an attack happened and how to recover in a way that preserves evidence and prevents the recurrence of an attack. Work with your incident response team to develop a plan.

As part of your organization’s business continuity plan your systems should be classified according to their business criticality. This will help you assign correct security controls to systems, assess dependencies, and prioritize work during an incident.

Ensure that contact information and roles & responsibilities documents are stored in a place that will be accessible if IT systems are offline. Many organizations, with otherwise terrific business continuity plans, have found themselves hampered because their plans were stored on systems that were inaccessible because of the outage.  

Filter Tags