Dynatrace Inc.

11/11/2024 | Press release | Distributed by Public on 11/11/2024 14:13

SLOs for Kubernetes clusters: Optimize resource utilization of Kubernetes clusters with service-level objectives

Effectively managing the provisioning of Kubernetes resources can be daunting due to the complex architecture and involvement of different stakeholders. In the end, a Kubernetes cluster owner needs to ensure that all their application teams have enough resources while working with the business owner to keep costs as low as possible. In this man-of-the-middle situation it's crucial to find a commonly understandable way to allocate and provision cluster resources efficiently, meeting performance and cost targets.

Establishing SLOs for Kubernetes clusters can help organizations optimize resource utilization. A good Kubernetes SLO strategy helps teams manage and make containerized workloads more efficient.

Kubernetes is a widely used open source system for container orchestration. It allows for seamless running of containerized workloads across different environments.

Effective resource provisioning and management is a critical aspect of a Kubernetes cluster. It involves a coordinated effort among stakeholders, including cluster owners, application teams, and business owners. The primary goal is to allocate sufficient resources to keep the applications running smoothly without overprovisioning and incurring unnecessary costs. This delicate balance requires a collaborative approach and transparent communication among all parties. Service-level objectives (SLOs) can play a vital role in ensuring that all stakeholders have visibility into the resources being used and the performance of their applications.

While the Kubernetes application from Dynatrace already provides detailed insights into key parameters such as memory or CPU slack-allowing you to immediately drill down to the workloads blocking and wasting resources-SLOs add a further layer on top of these insights by evaluating resource spending levels.

Service-level objectives are typically used to monitor business-critical services and applications. However, due to the fact that they boil down selected indicators to single values and track error budget levels, they also offer a suitable way to monitor optimization processes while aligning on single values to meet overall goals. Essentially, an SLO tracks a selected service-level indicator (SLI) and continuously evaluates its behavior over a given timeframe against a fixed (commonly agreed upon) threshold. This feature is valuable for platform owners who want to monitor and optimize their Kubernetes environment. Users can continuously evaluate a system's performance against predefined quality criteria, making SLOs for Kubernetes clusters a good option for monitoring and improving a system's overall performance.

Efficient coordination of resource usage, requests, and allocation is critical. All users must work together seamlessly to optimize resource utilization in a Kubernetes cluster. A Kubernetes SLO can be a transparent and trackable collaboration tool for different teams and collaborators.

Measure your Kubernetes resource efficiency and agree on an aligned approach between all stakeholders

If your team is responsible for setting up Kubernetes clusters, you might want to monitor and optimize the workload performance when setting up SLOs. However, if you're part of the application team, the usage of reserved resources might differ significantly from the blocked resources

Monitoring your Kubernetes cluster allows you to proactively identify and resolve resource constraints, failures, and crashes before they impact end-user experience and your business.

Proper Kubernetes monitoring includes utilizing observability information to optimize your environment. By gaining insights into how your Kubernetes workloads utilize computing and memory resources, you can make informed decisions about how to size and plan your infrastructure, leading to reduced costs. A Kubernetes SLO that continuously evaluates CPU, memory usage, and capacity and compares these available resources to the requested and utilized memory of Kubernetes workloads makes potential resource waste visible, revealing opportunities for countermeasures.

Regarding resource utilization in a Kubernetes environment, there are two main perspectives to consider based on the primary stakeholders involved. One perspective focuses on the potential for optimization at the interface between the team responsible for managing the Kubernetes cluster and the application teams that are responsible for developing and deploying applications.

If your team is responsible for setting up Kubernetes clusters, you might want to monitor and optimize the workload performance when setting up SLOs. However, if you're part of the application team, the usage of reserved resources might differ significantly from the blocked resources.

Requests represent the number of resources reserved or blocked for a container. Tracking the ratio between request and usage can provide valuable insights into optimization potential. As every container has defined requests for CPU and memory, these indicators are well-suited for efficiency monitoring.

Optimize memory and CPU utilization of your Kubernetes namespaces

One option is continuously tracking the memory and CPU utilization efficiency of existing Kubernetes objects, such as namespaces or workloads. Since teams typically run multiple workloads on namespaces, using this level of abstraction is a suitable option for a Kubernetes SLO. However, if you require more granular information, you can adjust the levels for resource utilization monitoring accordingly.

To calculate the service-level indicator for the Kubernetes namespace memory efficiency SLO, simply query the memory working set and request the memory metrics that are provided out of the box.

timeseries {
      memWorkSet=sum(dt.kubernetes.container.memory_working_set, default:0, rollup:avg)
      , memRequest=sum(dt.kubernetes.container.requests_memory, rollup:avg)}
      , nonempty:true, by:{dt.entity.cloud_application_namespace} 
      , filter: IN(dt.entity.cloud_application_namespace, "YOUR-NAMESPACE-ENTITY-ID")
| fieldsAdd sli = memWorkSet[]/memRequest[]*100
| fieldsRemove memWorkSet, memRequest

This DQL query can be inserted into the new SLO capability of the Dynatrace web UI, thereby supporting native Dynatrace Grailâ„¢ access via DQL.

The same procedure is applicable for measuring the computing resources of a single namespace.

timeseries {
      cpuUsage = sum(dt.kubernetes.container.cpu_usage, default:0, rollup:avg)
      , cpuRequest = sum(dt.kubernetes.container.requests_cpu, rollup:avg)}
      , nonempty:true, by:{dt.entity.cloud_application_namespace}
      , filter: IN(dt.entity.cloud_application_namespace, "YOUR-NAMESPACE-ENTITY-ID")
| fieldsAdd sli = cpuUsage[]/cpuRequest[]*100
| fieldsRemove cpuUsage, cpuRequest

These Kubernetes SLOs measure the ratio of used memory and computing resources relative to the requested memory and resources for an entire namespace. They provide insights into how efficiently the blocked resources are utilized. Because requested resources can't be used elsewhere, the objective is to keep the difference between requested and used resources as small as possible. Utilizing such SLOs for your Kubernetes environment makes it possible to track efficiency transparently over time. The teams responsible for the cluster and the application teams that run their containers on the cluster can agree on the intended ratio between used and requested memory and CPU.

Optimize your Kubernetes cluster's resource allocation

Measuring and optimizing the efficient usage of resources relative to resource requests is just one aspect of the full picture. As cluster owners must allocate cloud resources to meet their application team's resource requests, it's recommended that they measure the efficient provisioning of allocated resources over the entire Kubernetes cluster.

For cluster owners, monitoring resource usage at the node level provides better insights and information for taking sound actions. By tracking resource usage at each node, teams can gain insights into how many resources the entire cluster uses and whether the nodes work correctly and efficiently.

Teams should implement suitable SLOs to continuously monitor resource usage for the entire Kubernetes cluster. These SLOs should cover metrics like node memory utilization, which involves monitoring the ratio of requested versus allocated resources or the ratio of desired versus running pods per node.

For instance, if a node's memory is high, it can lead to the undesired deletion of pods. This, in turn, can disrupt an application or service and incur additional costs for potential cluster upscaling.

A possible Kubernetes SLO for monitoring and evaluating the ratio of requested versus allocated memory can be expressed as follows:

timeseries {
      requests_memory = sum(dt.kubernetes.container.requests_memory, rollup:avg),
      memory_allocatable = sum(dt.kubernetes.node.memory_allocatable,rollup:avg)}
      , by:{dt.entity.kubernetes_cluster}
      , filter: IN (dt.entity.kubernetes_cluster, "YOUR-CLUSTER-ENTITY-ID")
| fieldsAdd sli = (requests_memory[] / memory_allocatable[]) * 100
| fieldsRemove requests_memory, memory_allocatable

This SLO tracks the CPU utilization of a Kubernetes cluster:

timeseries {
      cpuUsage = sum(dt.kubernetes.container.cpu_usage, default:0, rollup:avg)
      , cpuRequest = sum(dt.kubernetes.container.requests_cpu, rollup:avg)}
      , nonempty:true, by:{dt.entity.cloud_application_namespace}
      , filter: IN(dt.entity.cloud_application_namespace, "YOUR-CLUSTER-ENTITY-ID")
| fieldsAdd sli = cpuUsage[]/cpuRequest[]*100
| fieldsRemove cpuUsage, cpuRequest

In an ideal environment, the requested resources match the allocated resources, and hence, cloud resources will be used optimally, reducing the costs of invoiced but unused resources.

Ensure overall Kubernetes resource utilization efficiency

The four SLOs mentioned in this article provide valuable insights into the overall utilization efficiency of resources in a Kubernetes cluster and are a great starting point for optimizing them. Although both views (usage vs. request and request vs. allocation) must be analyzed together to obtain a holistic view, the split is extremely helpful in ensuring accountability. While the application teams have complete control over the usage/request SLO, the cluster owner can influence the request vs. allocation SLO. The SLOs serve as a simple way to recognize if the utilization over a certain period is on track or if resources are utilized inefficiently.

Though setting up proper SLOs to make these ratios visible and transparent does not reduce the need for collaboration and alignment, it provides a solid foundation for optimizing resource utilization efficiency.

The outlined SLOs for Kubernetes clusters guide you in implementing SRE best practices in monitoring your Kubernetes environment. By recognizing the insights provided, you can optimize processes and improve overall efficiency. This makes this an excellent solution for collaboration between contributors while holding the respective parties accountable.

What's next with SLOs for Kubernetes clusters?

As getting started with efficiency and optimization efforts within a large environment is always tricky, you might ask, "Where should I start?" Dynatrace can help you here with all the insights the Dynatrace Kubernetes app provides out of the box. For more information about identifying and optimizing resources, go to the Optimize workload resource usage with Kubernetes app and Notebooks use case in Dynatrace Documentation. Also, visit the Kubernetes Dynatrace playground, where you can play around with the Kubernetes app on the Dynatrace platform. Finally, to try out the SLO examples described in this article, visit the Dynatrace Service-Level Objectives playground.

After identifying SLOs, stakeholders can begin measuring the agreed-upon resource utilization between clusters and namespaces. The ability to create and monitor SLOs will soon be natively integrated into the Kubernetes app, with dedicated templates, to quickly establish common objectives for resource utilization in a few clicks without needing to write DQL queries. These out-of-the-box templates and their visualization within the Kubernetes app will be available early next year.

Further best practices include adding dedicated ownership information to all Kubernetes objects so that responsible teams can be notified when anomalies or degradation are detected.

If you want to read and learn more about how cloud-native observability and automation can be achieved with Dynatrace, take a look at Michael Winkler's recently published blog post, Cloud-native observability made seamless with OpenPipeline.

Visit the Kubernetes Dynatrace playground, where you can play around with the Kubernetes app on the Dynatrace platform.