Monitor TKG With Prometheus and Grafana for VMC on AWS

Introduction to Kubernetes Monitoring

Monitoring a Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources including memory, CPU, and storage. Cluster operators can monitor and receive alerts if the desired number of pods are not running, the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become non-participant in the cluster.

Scope

This article is focused on setting up Prometheus and Grafana Extensions on a Tanzu Kubernetes Cluster.

Prerequisites

Before you deploy the TKG extensions, you should meet the following prerequisites:

Deploy Tanzu Kubernetes Cluster (workload cluster).
Install Carvel Tools on the machine you are using to manage your Tanzu Kubernetes Clusters.
Install Contour Extension on the cluster where you are installing Grafana & Prometheus. Instructions for installing Contour is documented here.
Upload Tanzu Kubernetes Grid Extension bundle on the machine from where the installation will be triggered. The extension bundle can be downloaded from here.

Challenges of Kubernetes Monitoring

Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong.

The main challenges that are associated with monitoring a Kubernetes environment are:

Millions of Metrics: Kubernetes is a multi-layered solution. You have a management plane and a control plane. Each plane runs a number of components for e.g kube-api server, kube-controller, dns, kube-scheduler, etc. You then have your workloads which typically include Pods, deployments, replicas, etc. These components churn out millions of events per day. Not every metric is important.
Ephemerality: The dynamic nature of Kubernetes allows you to scale up/scale down deployments on the fly. When an application is scaled down, the underlying pods will disappear forever. As new deployments are scheduled, the Kube-scheduler may decide that it needs to move a pod in order to free up resources on a given node. This results in pods being moved and recreated — the same pod, just with a different name and in a different place. The monitoring solution should pick up these changes and should not send alerts for these types of events.

Which Kubernetes Metrics Should You Monitor?

It is very important to know which Kubernetes metrics to monitor. Some key metrics to consider are:

Cluster state, including the health and availability of pods.
Node status, including readiness, memory, disk or processor overload, and network availability.
Pod availability.
Memory utilization at the pod and node level.
Disk utilization including lack of space for file system and index nodes.
CPU utilization in relation to the amount of CPU resource allocated to the pod.
API request latency measured in milliseconds.

Monitoring Tanzu Kubernetes Grid Instances

Monitoring for Tanzu Kubernetes Grid provisioned clusters is implemented using the open-source projects Prometheus and Grafana. Both Prometheus and Grafana are bundled with Tanzu Kubernetes Grid Extensions and installed on top of the Tanzu Kubernetes Cluster. The TKG Extension binaries are built and signed by VMware.

Installing TKG Extensions

Install Cert Manager on Workload Clusters

Before you can deploy the TKG extensions, you must install cert-manager, which provides automated certificate management, on workload clusters. The cert-manager service runs by default in management clusters when the cluster is provisioned.

Step 1 - Extract the TKG Extensions bundle using tar or a similar utility and execute the below commands to install the cert-manager.

# tar -xzf tkg-extensions-manifests-v1.3.1-vmware.1.tar.gz

# cd tkg-extensions-v1.3.1

# kubectl apply -f cert-manager/

Step 2 - Validate that cert manager pods are deployed correctly and are in running state.

# kubectl get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-7c58cb795-jw7mk               1/1     Running   0          2m38s
cert-manager-cainjector-765684c9d6-qgcw9   1/1     Running   0          2m38s
cert-manager-webhook-ccc946479-gnbvh       1/1     Running   0          2m37s

Install Prometheus Extension

Prometheus is an open-source event monitoring tool for containers or microservices. It can collect metrics from target clusters at specified intervals, evaluate rule expressions, display the results, and trigger alerts if certain conditions arise.

Execute the below commands to install the Prometheus extension.

Step 1: Create Prometheus namespace.

# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/prometheus

# kubectl apply -f namespace-role.yaml

The kubectl apply command creates the tanzu-system-monitoring namespace along with a service account for prometheus-extension and necessary role bindings.

namespace/tanzu-system-monitoring created
serviceaccount/prometheus-extension-sa created
role.rbac.authorization.k8s.io/prometheus-extension-role created
rolebinding.rbac.authorization.k8s.io/prometheus-extension-rolebinding created
clusterrole.rbac.authorization.k8s.io/prometheus-extension-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-extension-cluster-rolebinding created

Step 2: Prepare the Prometheus yaml for deployment.

# cp prometheus-data-values.yaml.example prometheus-data-values.yaml

The supported configuration parameters are documented here.

A sample prometheus-data-values.yaml is shown below

---
monitoring:
  ingress:
    enabled: true
    virtual_host_fqdn: "prometheus.tanzu.lab"
    prometheus_prefix: "/"
    alertmanager_prefix: "/alertmanager/"
  prometheus_server:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  alertmanager:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  kube_state_metrics:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  node_exporter:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  pushgateway:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  cadvisor:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  prometheus_server_configmap_reload:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus
  prometheus_server_init_container:
    image:
      repository: projects.registry.vmware.com/tkg/prometheus

Step 3: Create Prometheus secret.

# kubectl create secret generic prometheus-secret --from-file=values.yaml=prometheus-data-values.yaml -n tanzu-system-monitoring

Step 4: Deploy Prometheus extension.

# kubectl apply -f prometheus-extension.yaml

Step 5: Retrieve the status of Prometheus extension.

# kubectl get app prometheus -n tanzu-system-monitoring

Prometheus app status should change to 'Reconcile Succeeded' after Prometheus is deployed successfully.

NAME         DESCRIPTION           SINCE-DEPLOY   AGE
 
prometheus   Reconcile succeeded   48s            2d6h

Install Grafana Extension

Grafana is an open-source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In other words, Grafana provides you with tools to turn your time-series database (TSDB) data into high-quality graphs and visualizations.

Step 1: Create Grafana namespace.

# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/grafana/

# kubectl apply -f namespace-role.yaml

The kubectl apply command creates the service account for grafana-extension and necessary role bindings.

namespace/tanzu-system-monitoring unchanged
serviceaccount/grafana-extension-sa created
role.rbac.authorization.k8s.io/grafana-extension-role created
rolebinding.rbac.authorization.k8s.io/grafana-extension-rolebinding created
clusterrole.rbac.authorization.k8s.io/grafana-extension-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/grafana-extension-cluster-rolebinding created

Step 2: Prepare the Grafana yaml for deployment.

# cp grafana-data-values.yaml.example grafana-data-values.yaml

The supported configuration parameters are documented here.

A sample grafana-data-values.yaml file is shown below.

---
monitoring:
  grafana:
    ingress:
      enabled: true
      virtual_host_fqdn: "grafana.tanzu.lab"
    image:
      repository: "projects.registry.vmware.com/tkg/grafana"
    secret:
      admin_password: "Vk13YXJlMSE="
  grafana_init_container:
    image:
      repository: "projects.registry.vmware.com/tkg/grafana"
  grafana_sc_dashboard:
    image:
      repository: "projects.registry.vmware.com/tkg/grafana"

Step 3: Create Grafana secret.

# create secret generic grafana-data-values --from-file=values.yaml=grafana-data-values.yaml -n tanzu-system-monitoring

Step 4: Deploy Grafana extension.

# kubectl apply -f grafana-extension.yaml

Step 5: Retrieve the status of Grafana extension.

# kubectl get app grafana -n tanzu-system-monitoring

Grafana app status should change to 'Reconcile Succeeded' after Grafana is deployed successfully.

NAME      DESCRIPTION           SINCE-DEPLOY   AGE
 
grafana   Reconcile succeeded   6m59s          10m

Accessing the Prometheus & Grafana Dashboards

To access the dashboards, you must know the external IP address of the envoy service that gets created when you deploy the Contour extension on the Tanzu Kubernetes Cluster. Enter the following command to get the IP address.

# kubectl get services -A | grep envoy
 
tanzu-system-ingress      envoy                              LoadBalancer   100.69.174.170   172.19.80.55   80:31384/TCP,443:31510/TCP   2d8h

Confirm the FQDN of the Grafana & Prometheus extension.

# kubectl get proxy -A
 
NAMESPACE                 NAME                   FQDN                   TLS SECRET       STATUS   STATUS DESCRIPTION
 
tanzu-system-monitoring   grafana-httpproxy      grafana.tanzu.lab      grafana-tls      valid    Valid HTTPProxy
 
tanzu-system-monitoring   prometheus-httpproxy   prometheus.tanzu.lab   prometheus-tls   valid    Valid HTTPProxy

In your DNS Server create an A record for the prometheus extension pointing to the IP external IP address of envoy. In addition to this also create an Alias for grafana pointing to the prometheus extension hostname.

To access the Prometheus dashboard, enter the URL https://<prometheus-app-fqdn>/.

To access the Grafana dashboard, enter the URL https://<grafana-app-fqdn>/.

Conclusion

Kubernetes is the way forward for modern cloud-native applications because of the significant benefits that it provides. Monitoring the Kubernetes environment is complicated and the traditional monitoring solutions may not have the ability to monitor the cluster health, identifying issues, and figuring out how to remediate problems are common obstacles organizations face, making it difficult to fully realize the benefits and value of their Kubernetes deployment. Organizations need a new approach to monitoring to take full advantage of the Kubernetes benefits.

By understanding the complexities behind Kubernetes monitoring, you can better identify a solution that will allow you to derive more value from your Kubernetes deployment. The monitoring solution that you choose should provide turnkey capabilities for identifying and remediating specific failures seen in Kubernetes deployments like crash loops, job failures, CPU utilization, etc. The solution should be capable of filtering the relevant metrics and displaying them to the user.

Links and References

Author and Contributors

Manish Jha has authored this article.