Monitor TKG With Prometheus and Grafana for VMC on AWS
Introduction to Kubernetes Monitoring
Monitoring a Kubernetes cluster eases the management of containerized infrastructure by tracking utilization of cluster resources including memory, CPU, and storage. Cluster operators can monitor and receive alerts if the desired number of pods are not running, the resource utilization is approaching critical limits, or when failures or misconfiguration cause pods or nodes to become non-participant in the cluster.
Scope
This article is focused on setting up Prometheus and Grafana Extensions on a Tanzu Kubernetes Cluster.
Prerequisites
Before you deploy the TKG extensions, you should meet the following prerequisites:
- Deploy Tanzu Kubernetes Cluster (workload cluster).
- Install Carvel Tools on the machine you are using to manage your Tanzu Kubernetes Clusters.
- Install Contour Extension on the cluster where you are installing Grafana & Prometheus. Instructions for installing Contour is documented here.
- Upload Tanzu Kubernetes Grid Extension bundle on the machine from where the installation will be triggered. The extension bundle can be downloaded from here.
Challenges of Kubernetes Monitoring
Kubernetes abstracts away a lot of complexity to speed up application deployment; but in the process, it leaves you blind as to what is actually happening behind the scenes, what resources are being utilized, and even the cost implications of the actions being taken. In a Kubernetes world, the number of components is typically more than traditional infrastructure, which makes root cause analysis more difficult when things go wrong.
The main challenges that are associated with monitoring a Kubernetes environment are:
- Millions of Metrics: Kubernetes is a multi-layered solution. You have a management plane and a control plane. Each plane runs a number of components for e.g kube-api server, kube-controller, dns, kube-scheduler, etc. You then have your workloads which typically include Pods, deployments, replicas, etc. These components churn out millions of events per day. Not every metric is important.
- Ephemerality: The dynamic nature of Kubernetes allows you to scale up/scale down deployments on the fly. When an application is scaled down, the underlying pods will disappear forever. As new deployments are scheduled, the Kube-scheduler may decide that it needs to move a pod in order to free up resources on a given node. This results in pods being moved and recreated — the same pod, just with a different name and in a different place. The monitoring solution should pick up these changes and should not send alerts for these types of events.
Which Kubernetes Metrics Should You Monitor?
It is very important to know which Kubernetes metrics to monitor. Some key metrics to consider are:
- Cluster state, including the health and availability of pods.
- Node status, including readiness, memory, disk or processor overload, and network availability.
- Pod availability.
- Memory utilization at the pod and node level.
- Disk utilization including lack of space for file system and index nodes.
- CPU utilization in relation to the amount of CPU resource allocated to the pod.
- API request latency measured in milliseconds.
Monitoring Tanzu Kubernetes Grid Instances
Monitoring for Tanzu Kubernetes Grid provisioned clusters is implemented using the open-source projects Prometheus and Grafana. Both Prometheus and Grafana are bundled with Tanzu Kubernetes Grid Extensions and installed on top of the Tanzu Kubernetes Cluster. The TKG Extension binaries are built and signed by VMware.
Installing TKG Extensions
Install Cert Manager on Workload Clusters
Before you can deploy the TKG extensions, you must install cert-manager, which provides automated certificate management, on workload clusters. The cert-manager service runs by default in management clusters when the cluster is provisioned.
Step 1 - Extract the TKG Extensions bundle using tar or a similar utility and execute the below commands to install the cert-manager.
# tar -xzf tkg-extensions-manifests-v1.3.1-vmware.1.tar.gz
# cd tkg-extensions-v1.3.1
# kubectl apply -f cert-manager/
Step 2 - Validate that cert manager pods are deployed correctly and are in running state.
# kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7c58cb795-jw7mk 1/1 Running 0 2m38s
cert-manager-cainjector-765684c9d6-qgcw9 1/1 Running 0 2m38s
cert-manager-webhook-ccc946479-gnbvh 1/1 Running 0 2m37s
Install Prometheus Extension
Prometheus is an open-source event monitoring tool for containers or microservices. It can collect metrics from target clusters at specified intervals, evaluate rule expressions, display the results, and trigger alerts if certain conditions arise.
Execute the below commands to install the Prometheus extension.
Step 1: Create Prometheus namespace.
# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/prometheus
# kubectl apply -f namespace-role.yaml
The kubectl
apply command creates the tanzu-system-monitoring namespace along with a service account for prometheus-extension and necessary role bindings.
namespace/tanzu-system-monitoring created
serviceaccount/prometheus-extension-sa created
role.rbac.authorization.k8s.io/prometheus-extension-role created
rolebinding.rbac.authorization.k8s.io/prometheus-extension-rolebinding created
clusterrole.rbac.authorization.k8s.io/prometheus-extension-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-extension-cluster-rolebinding created
Step 2: Prepare the Prometheus yaml for deployment.
# cp prometheus-data-values.yaml.example prometheus-data-values.yaml
The supported configuration parameters are documented here.
A sample prometheus-data-values.yaml is shown below
---
monitoring:
ingress:
enabled: true
virtual_host_fqdn: "prometheus.tanzu.lab"
prometheus_prefix: "/"
alertmanager_prefix: "/alertmanager/"
prometheus_server:
image:
repository: projects.registry.vmware.com/tkg/prometheus
alertmanager:
image:
repository: projects.registry.vmware.com/tkg/prometheus
kube_state_metrics:
image:
repository: projects.registry.vmware.com/tkg/prometheus
node_exporter:
image:
repository: projects.registry.vmware.com/tkg/prometheus
pushgateway:
image:
repository: projects.registry.vmware.com/tkg/prometheus
cadvisor:
image:
repository: projects.registry.vmware.com/tkg/prometheus
prometheus_server_configmap_reload:
image:
repository: projects.registry.vmware.com/tkg/prometheus
prometheus_server_init_container:
image:
repository: projects.registry.vmware.com/tkg/prometheus
Step 3: Create Prometheus secret.
# kubectl create secret generic prometheus-secret --from-file=values.yaml=prometheus-data-values.yaml -n tanzu-system-monitoring
Step 4: Deploy Prometheus extension.
# kubectl apply -f prometheus-extension.yaml
Step 5: Retrieve the status of Prometheus extension.
# kubectl get app prometheus -n tanzu-system-monitoring
Prometheus app status should change to 'Reconcile Succeeded' after Prometheus is deployed successfully.
NAME DESCRIPTION SINCE-DEPLOY AGE
prometheus Reconcile succeeded 48s 2d6h
Install Grafana Extension
Grafana is an open-source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In other words, Grafana provides you with tools to turn your time-series database (TSDB) data into high-quality graphs and visualizations.
Step 1: Create Grafana namespace.
# cd ~/tkg-extensions-v1.3.1/extensions/monitoring/grafana/
# kubectl apply -f namespace-role.yaml
The kubectl
apply command creates the service account for grafana-extension and necessary role bindings.
namespace/tanzu-system-monitoring unchanged
serviceaccount/grafana-extension-sa created
role.rbac.authorization.k8s.io/grafana-extension-role created
rolebinding.rbac.authorization.k8s.io/grafana-extension-rolebinding created
clusterrole.rbac.authorization.k8s.io/grafana-extension-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/grafana-extension-cluster-rolebinding created
Step 2: Prepare the Grafana yaml for deployment.
# cp grafana-data-values.yaml.example grafana-data-values.yaml
The supported configuration parameters are documented here.
A sample grafana-data-values.yaml file is shown below.
---
monitoring:
grafana:
ingress:
enabled: true
virtual_host_fqdn: "grafana.tanzu.lab"
image:
repository: "projects.registry.vmware.com/tkg/grafana"
secret:
admin_password: "Vk13YXJlMSE="
grafana_init_container:
image:
repository: "projects.registry.vmware.com/tkg/grafana"
grafana_sc_dashboard:
image:
repository: "projects.registry.vmware.com/tkg/grafana"
Step 3: Create Grafana secret.
#
create secret generic grafana-data-values --from-file=values.yaml=grafana-data-values.yaml -n tanzu-system-monitoring
Step 4: Deploy Grafana extension.
# kubectl
apply -f grafana-extension.yaml
Step 5: Retrieve the status of Grafana extension.
# kubectl get app grafana -n tanzu-system-monitoring
Grafana app status should change to 'Reconcile Succeeded' after Grafana is deployed successfully.
NAME DESCRIPTION SINCE-DEPLOY AGE
grafana Reconcile succeeded 6m59s 10m
Accessing the Prometheus & Grafana Dashboards
To access the dashboards, you must know the external IP address of the envoy service that gets created when you deploy the Contour extension on the Tanzu Kubernetes Cluster. Enter the following command to get the IP address.
# kubectl get services -A | grep envoy
tanzu-system-ingress envoy LoadBalancer 100.69.174.170 172.19.80.55 80:31384/TCP,443:31510/TCP 2d8h
Confirm the FQDN of the Grafana & Prometheus extension.
# kubectl get proxy -A
NAMESPACE NAME FQDN TLS SECRET STATUS STATUS DESCRIPTION
tanzu-system-monitoring grafana-httpproxy grafana.tanzu.lab grafana-tls valid Valid HTTPProxy
tanzu-system-monitoring prometheus-httpproxy prometheus.tanzu.lab prometheus-tls valid Valid HTTPProxy
In your DNS Server create an A record for the prometheus extension pointing to the IP external IP address of envoy. In addition to this also create an Alias for grafana pointing to the prometheus extension hostname.
To access the Prometheus dashboard, enter the URL https://<prometheus-app-fqdn>/.
To access the Grafana dashboard, enter the URL https://<grafana-app-fqdn>/.
Conclusion
Kubernetes is the way forward for modern cloud-native applications because of the significant benefits that it provides. Monitoring the Kubernetes environment is complicated and the traditional monitoring solutions may not have the ability to monitor the cluster health, identifying issues, and figuring out how to remediate problems are common obstacles organizations face, making it difficult to fully realize the benefits and value of their Kubernetes deployment. Organizations need a new approach to monitoring to take full advantage of the Kubernetes benefits.
By understanding the complexities behind Kubernetes monitoring, you can better identify a solution that will allow you to derive more value from your Kubernetes deployment. The monitoring solution that you choose should provide turnkey capabilities for identifying and remediating specific failures seen in Kubernetes deployments like crash loops, job failures, CPU utilization, etc. The solution should be capable of filtering the relevant metrics and displaying them to the user.