Splunk Inc.

04/27/2022 | News release | Distributed by Public on 04/27/2022 11:53

Kubernetes Incident Response Best Practices

Share:
By Stephen Watts April 27, 2022

Inevitably, organizations that use technology (regardless of the extent) will have something, somewhere, go wrong. The key to a successful organization is to have the tools and processes in place to handle these incidents and get systems restored in a repeatable and reliable way in as little time as possible.

Incident Response Overview

Every organization knows how to do firefighting. It's human nature to see a problem and automatically start trying everything you can think of to get it fixed as quickly as possible and then hope that life returns to normal. This approach may work well for a while, but it is the technological equivalent of duct tape - and it never ends well. Over time, you will be unable to maintain the more problematic or complex parts of the infrastructure. This leads to significant technical debt as well as frustrated staff who feel that they never have time to do things properly in order to advance the organization.

The value of proper incident response lies not only in finding and fixing problems, but also in the whole process of documenting errors and resolutions, which in turn will provide operations teams with a knowledge base for future reference. This will allow for much more time to build automation for mitigating and remediating common errors until development teams can apply permanent fixes.

Dynamic and Ephemeral Infrastructure

Computing took up entire buildings in the early days, and incident response took the form of actual bug hunting. Since then, we have shrunk the size of servers and even removed much of the physical part altogether, since most servers are now virtual. Thus, successfully resolving incidents now requires more knowledge across more domains of information technology.

We have now moved beyond virtual servers and entered the cloud era. In the cloud computing world, consumers do not even know the physical locations of virtual servers, making metadata invaluable for resolving incidents. Containers exponentially increase this complexity due to their ephemeral design, which does not include storage that automatically persists to keep log files, kernel panics, or other trace files that are often generated during an incident.

A Kubernetes cluster running in a single region of a public cloud can be not only spread across several physical servers, but also across different data centers which are miles away from one another. Due to the storage replication and network routing requirements for keeping all of the containers interacting, an enormous amount of data must be available for diagnostics.

What is Kubernetes?

According to the official website, "Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

The name Kubernetes originates from Greek, meaning helmsman or pilot. Google open-sourced the Kubernetes project in 2014. Kubernetes builds upon a decade and a half of experience that Google has with running production workloads at scale, combined with best-of-breed ideas and practices from the community."

So what does "portable, extensible, … for managed containerized workloads" actually mean?

Core Kubernetes capabilities

At its core, Kubernetes handles the needs of modern containerized applications. This means that Kubernetes has the following capabilities:

Service discovery and load balancing

Services are discoverable via DNS or directly by the IP of their pods within the cluster. If the service is high traffic, then Kubernetes can route traffic through a load balancer. Depending on the load-balancer, there are multiple ways to balance the traffic.

Storage orchestration

When an application requests storage, Kubernetes will create a persistent volume claim to bind a persistent volume in the Kubernetes cluster to the individual pods that make up the running application. The options are NFS, cloud storage, local disk, or various third-party add-ons through the CSI interface.

Automated rollouts and rollbacks

When deploying your application to a Kubernetes cluster, you will define the desired state of the application including the container image version, number of instances, and any required storage. Kubernetes will ensure that your running application matches the described state across cluster nodes.

Self-healing

The maintaining state also extends the ability of Kubernetes to monitor pods with a defined health check. If a pod starts to fail, Kubernetes will stop that pod and start a new one on a healthy node to replace it.

Automatic bin packing

As part of the application definition within Kubernetes, a developer can define the CPU and memory required for each pod. This helps Kubernetes balance the needs of that application against its available compute capacity and other applications that are running.

Secret and configuration management

You can deploy and update secrets and configuration items within the Kubernetes cluster without having to restart the affected containers. Secrets are only available to the containers that have been given permission. Secrets include SSH keys, usernames, passwords, and tokens.

Functionality not available in Kubernetes

What Kubernetes does not do is almost as important as what it does. Kubernetes does not:

Build source code

While some Kubernetes distributions include this capability, it falls under the domain of DevOps Continuous Integration (CI) rather than Kubernetes container runtimes.

Care what is in the containers that it runs

Kubernetes is agnostic to what runs in the container as long as it runs on the host OS and has been deployed to Kubernetes. This could be anything from Bitcoin mining to Wordpress blogs or Minecraft servers.

Mandate what language is used for source code or its configuration

Kubernetes does not mandate either the language in which the application is written or the language in which the configuration is stored. As long as the container and its deployment configuration meets the API specifications, then all is well.

Handle workflows or cross-application dependencies

While traditional orchestration follows a predefined plan for basic error handling, Kubernetes only focuses on maintaining the desired end state. With traditional orchestration, you normally execute step one, and if that is successful, then you execute step two; however, if step one is not successful, then you pause and employ a contingency plan, such as alerting the operations team.

Kubernetes, on the other hand, focuses on maintaining the desired state of an application in a parallel rather than a linear plan. It watches all aspects of the defined application simultaneously so that incidents can be corrected as they occur. This means that if a container instance in a pod fails and the whole pod needs to be updated and restarted, for example, Kubernetes will need to shut down existing incidents while spinning up the proper number of new instances.

Why does Kubernetes matter?

Kubernetes has become the starting point for essentially every product and managed service on the market that runs containers. It has been described as "Linux for the application service tier."

When Linux arrived on the computing market, it standardized the operating system, allowing for greater diversity of products and vendors available at the hardware tier. Now, Kubernetes is abstracting away the infrastructure layer as a whole (including the now default Linux operating system). This enables developers to handle the entire lifecycle without needing to know the underpinning intricacies which can be left to specialists. It is now possible to move an application service between Kubernetes clusters based on factors such as cost without rewriting the entire application stack.

Every generation of IT commoditizes a new layer of the technology stack, and the last two have been driven by open-source based technologies. These open-source offerings have so much universal benefit that large technology companies are willing to work together to make a solid base product and then compete with one another through pricing and extensions.

Who are the People?

The solutions that an organization is able to implement will depend on three things: "people, process, and technology." This phrase was coined by Harold Leavitt and encapsulates the truism that we often get so focused on the technology that we overlook the people who are involved with it (Leavitt, "Applied Organizational Change in Industry," Carnegie Institute of Technology, Graduate School of Industrial Administration, 1962).

For proper incident response to occur, people from all segments of an organization must have a stake in the operational life for it to be successful. This extends from the business units that use the application or provide client support to the various technical teams that handle the day-to-day maintenance of the application and the infrastructure that it runs on.

DevOps

If an error occurs during the build or deployment process, the DevOps organization needs to be part of the incident response. Incidents can happen at any stage of these processes, from a check-in of code failing to trigger an event to a production deployment failing due to a corrupt or otherwise unusable container image.

Many types of incidents are escalated to DevOps teams, including problems with the toolset, the failure of a step in a continuous pipeline to execute, the failure of a unit or integration test, or the failure of an application to deploy after the latest pull request was merged. The first two types of incidents can be resolved by DevOps without engaging outside teams. Resolutions for such incidents include the expansion of a full Nexus repository disk or the renewal of an expired "Let's Encrypt" certificate. DevOps teams can also help resolve the latter two types of incidents by quickly identifying errors which development teams can fix and then push through the source management tool in order to kick off the continuous pipeline for that application again.

The DevOps team will spend time tuning the incidents that they have escalated to them. There is a fine line between an integration test's failure and its failure to run due to an environmental problem controlled by the DevOps team. For example, this might happen when the scripts that build the staging environment haven't been updated to reflect new application dependencies.

Software Development (Programmers, Engineers and Architects)

The software development teams write code, define container dependencies, and kick off builds and deploys through the DevOps processes and tools, and they should be aware that they are responsible for what they produce. In modern software development, the continuous integration and deployment of lifecycles means that a code change can be committed at 2 a.m. and deployed globally before the team lead even gets to the office in the morning.

The escalation of incidents to development organizations should be as routine and hands-off as possible through integration with defect tracking systems. An after-hours call for help should be reserved for emergencies such as the discovery of a critical security flaw, ongoing data corruption, or when the application is in a state of total failure.

Core Infrastructure Support

The teams that run the compute, network, storage, and data center infrastructure have always been part of the incident response escalation process, and Kubernetes and containers do not change that at all. What has changed is how these teams are engaged. Modern applications can now be engineered to be fault-resilient and handle the loss of a single piece of hardware. A large part of that ability comes from the introduction of the SRE position, infrastructure as a service, and tools that handle infrastructure as code.

Information will freely flow in and out of the core infrastructure support teams through traditional change management processes like ITIL. When you have a fully cloud-based and containerized infrastructure, however, incidents will only be escalated to these teams when there is a catastrophic event, such as when the global DNS is down.

"The Business"

The term "The Business" is intentionally vague. It covers all of the non-technical areas of the company, from the people who approve the funding for those massive monthly Google Kubernetes Engine (GKE) bills, to the marketing teams who gather requirements, to customer service teams who actually speak to the customers experience problems, and everyone in between. It is essential to include these teams in any incident response improvements, since they are ultimately the people for whom the technology teams are building applications. The more they know, the less they will bother the technology teams, thereby enabling the latter to focus properly on responding to and resolving the incident at hand.

If you feel like you've said it 1,000 times, then say it 1,001, because there's always someone who didn't know. Awareness of both the status of all core systems and their performance (from perfect to degraded or unavailable) is key and will make it much easier to justify new expenses like replacing your twenty-year-old logging system with one that can handle cluster-wide logging from Kubernetes.

Kubernetes: What to Watch

Infrastructure

Like the holy trinity of Cajun cooking used in every dish (celery, onions, and bell peppers), the technology world has compute, network, and storage. Kubernetes is no different; it has to run the workloads which it's orchestrating somewhere and connect to them somehow.

Monitoring infrastructure is routine, and most organizations will have processes in place to handle standard errors in the base Kubernetes infrastructure. There are, however, a few unique aspects inside a Kubernetes cluster to consider when documenting how to respond to incidents and who to engage.

Compute

Compute at the virtual machine or bare metal level is the most straightforward component in the Kubernetes stack to monitor. It should be no surprise that it's all about processor and memory usage.

Kubernetes has the ability to autoscale both the number of pods that an application has available and the actual cluster itself by adding nodes based upon the activity level of the processor cores in each worker node. Unfortunately, CPU alone is not a great metric for overall application performance; for example, some applications have performance problems once the CPU passes 50% usage. The one thing that all support teams can agree on is that when all of the processing capacity is used in the cluster, you need to add more processor cores.

Network

Kubernetes does not handle the network on its own; rather, it delegates it through the CNI (container networking interoperability) plugin framework. This is one of the hardest areas to troubleshoot since it can be extremely complex depending on the type of network and the number of policies applied to it.

Kubernetes has a default CNI plugin called kubenet which meets the absolute minimum requirements for the network capabilities on a Kubernetes node by creating a local subnet with a bridge (cbr0) and assigning a veth pair and IP for each pod. More complex plugins can create entire overlay networks to abstract traffic within the Kubernetes cluster. This allows it to have its own routing and ingress/egress points to talk to the local hosts and other existing services on the network outside the cluster. These plugins can also be multi-tenant, so each namespace has its own cross-node subnet and network policies applied during an application deployment, creating allow and deny rules to define who can access the application.

One of the best ways to handle network troubleshooting in these more complex cases is by leveraging APM to show service interactions. Those tools can identify failed or failing network calls and alert accordingly.

Storage

In Kubernetes there are three types of storage: Physical storage on each node is used to cache container images, host the exploded image while it is being run, and maintain general operating system usage. Depending on the level of performance required, solid state drives make the most sense. IOPS and disk usage needs to be trended and alerted through normal infrastructure monitoring tools. Ephemeral storage inside each running container leverages the physical storage on the disk. There is a way to specify a minimum amount of disk space required to launch the container, and depending on the customer, there may be nodes for handling higher and lower I/O which can be labeled to aide Kubernetes in orchestration. However, there is really no way to control the top end of disk usage inside an individual running container without getting into storage quotas at the OS level, which can add a layer of complexity that is not worth it in most environments.

As this usage ebbs and flows based on what each container is doing, it is important to monitor trends and alert when baselines are broken. The required remediation could include moving the offending containers to hosts with more free resources or simply knowing the requirements of the application so that when usage peaks, extra space can be added or the offending container can be moved. Persistent Volumes are the third type of storage, and they are key to the usefulness of containers in most organizations. A persistent volume (PV) allows individual applications to make a persistent volume claim (PVC), giving them the ability to maintain state across runs. For applications like databases, this is key for the ability to move to a containerized environment.

The connection to PVCs through PVs is moving towards being handled exclusively through the CSI (container storage interoperability) plugin framework. The metrics to watch around persistent volumes are: the amount of space allocated on the volume (to make sure that there is room for more claims), the overall disk usage (to make sure that the allocated space is available), and the overall performance of the storage. Persistent Volumes can be anything from NFS, to LUNs mounted via a SAN on each node, to local disks managed by Ceph or PX-Enterprise in order to replicate data and make it highly available across the Kubernetes cluster.

High-level application monitoring

Application monitoring has the same concepts across any style of application deployment in addition to an infrastructure based upon the same core principles as non-container workloads. The best model for application monitoring is application performance management (APM), since it has defined categories for types of monitoring and most tools in the space map nicely to one or all of them.

The four types of data that should be gathered are:

End user experience monitoring, which means watching real users interact with the system, typically via the user interface or as they pass through an API Gateway.

Deep-dive diagnostics (also known as application transaction profiling), which watches the execution of transactions within the application runtime and records which components are being executed.

Synthetic transactions, which are fake user sessions run through automation in order to create a baseline for one or more functions in the application. These will detect logic failures (when a component or deployment is in a partial failure) and performance trends over time and across releases.

Logs generated by the actual applications code and standard errors produced by the platform as well as the language upon which the application is based.

Ideally, these data points will be gathered and correlated in order to create real maps of component interactions within each application's pod as well as calls to services both within and outside of the Kubernetes environment in which the application is running. The logs are combined with infrastructure logs and any other logs gathered from within the Kubernetes cluster so that they can be analyzed and visualized to provide valuable data on trends and errors as they are reported.

Using these four types of data, every failure within an application pod can be caught and escalated as an incident. The correlation and visualization allows for the response to be targeted and thus drastically more effective than guessing where the incident is rooted in the application call tree.

Kubernetes master (control plane) components

etcd

This is the single most important component that underpins a Kubernetes cluster. You should have backups upon backups of this because it stores all of the configuration, and if you have an unrecoverable error, then restoring will be your only option.

Errors that you might encounter within an etcd cluster include a leader failing, in which case a new leader will be elected, and a minor/network failure, in which some of the nodes in the cluster get separated. The cluster will continue to work in this scenario, but the nodes should be restored for full resilience. If more than half of the nodes go offline, the cluster will be read-only until 51% of the nodes are back online. Only then can a new leader be elected to coordinate writes.

The status of all of these errors can be displayed through the etcd_server_ metrics which can be extracted and viewed by tools like Prometheus. In addition, metrics under etcd_disk_ will highlight problems with storage latency, and etcd_network_ will identify multiple metrics related to network throughput as well as any failed transmissions.

The documentation on etcd.io contains an operations guide and includes a thorough list of configuration options, failover scenarios, monitoring, and recovery recommendations.

kube-apiserver

If etcd is performing properly, then only a limited number of incidents can be generated by the kube-apiserver because it handles requests/responses between external clients and the control plane components. The default location for logs is /var/log/kube-apiserver.log, and since the application is written in Go, the standard performance objects are memory, threads, and processor usage. There will be at least three instances of kube-apiserver running in a highly available cluster, so a failure of one instance will not cause a performance problem while it is being brought back online.

While kube-apiserver is unavailable, all administration tools will be unavailable, including pulling status from the kubectl command line. To access logs from the kube-apiserver to start diagnosing the incident - to find out if it's in a hung state or if it's a deeper problem, for example - and assuming Docker is the container runtime engine, run the following command on each master node where it is running, to find the instance in trouble.


# docker ps -a | grep kube-apiserver

At this point you can retrieve the contents of stderr and stdout directly from the running process with the command:


# docker logs CONTAINER_ID --tail 500

With the most recent log files visible, you can see what errors are being generated, and take appropriate actions. These actions could be as simple as restarting the container, or as complex as extending volume groups to add more storage capacity to the server.

kube-scheduler

There will only be one instance of kube-scheduler running at any given time. In the event of a failure, you will need to deploy a new scheduler pod in the cluster and then remove the old one. This is the most crucial part of a cluster aside from the etcd database that it uses for storage. The Kubernetes cluster is essentially unmanaged while kube-scheduler is unavailable since there will be nothing to ensure that all of the applications in the cluster are running with the appropriate state. The default log location is /var/log/kube-scheduler.log, and normal performance metrics for applications are available for CPU, memory, and storage usage.

Almost all interactions with kube-scheduler will be when resolving issues with application deployments whether creating a new deployment or resolving why a deployment was not successful. In the event the kube-scheduler is unavailable the steps are similar to the ones for the kube-apiserver with the difference being it will only be running on one master node so the search will be to find where it is running (if you don't already know) and not to find which instance is having issues.

kube-controller-manager

The kube-controller-manager is actually made of multiple separate processes which have been combined into a single runtime for easier management. The cluster will keep operating if this is down, but it will not react to any changes in the state of pods or nodes, and you will not be able to create new namespaces.

You will find any relevant errors and fault conditions in /var/log/kube-controller-manager.log, which is the default log location. Further investigating the kube-controller-manager will be the best place to start in the following scenarios:

· If the status of nodes is incorrect, such as when running nodes are showing as down.

· If services aren't being updated with the correct pod information.

· If persistent volumes aren't being allocated.

· If default accounts and tokens aren't set for new namespaces.

As the kube-controller-manager is a stateless application the single best way to resolve most issues directly related to it is to restart the affected pod. The cluster state is maintained in coordination between the kubelet on each node, and by kube-scheduler using etcd. This means if the incident looks to be a configuration error then kube-controller-manager is just another propagating the problem and not actually the source of the incident.

cloud-controller-manager

The cloud-controller-manager is a subset of the services in the main kube-controller-manager that handles the delegation of command and control to cloud-specific managers.

Common incidents involving this component include the failure of a service request made to a cloud provider, and the failure of a configuration change to sync between Kubernetes and the cloud provider. The cloud controller manager does not specifically handle storage, since the CSI plugin framework handles those delegations.

Troubleshooting the cloud-controller-manager follows the exact same principles as the kube-controller-manager, as the cloud-controller-manager has no state of its own to maintain. It executes, observes, and reports.

Kubernetes node components

kubelet

The kubelet is the daemon that runs on every node and acts as the local boss for anything the cluster managers want done on that node. This primarily involves starting up or shutting down pods and adjusting configuration files. If a node is running but not showing as active in the cluster, or if the running containers on a host do not match what the cluster thinks should be running, then the kubelet is most likely at fault. The default log file location is /var/log/kubelet.log.

When kubelet on a specific node is having problems it will show up as unhealthy in the cluster status and the replication controllers will not act like the node has failed as the services will still be able to access the pods (via kube-proxy), but all requests to start new pods will be redirected to other nodes which have the capacity. Over time if errors in kubelet are not resolved you could have a completely idle node which is no help to anyone.

Healthy nodes report "Ready" as their state, other status need to be investigated.


$ kubectl get nodes
NAME STATUS AGE
192.168.1.10 Ready 2d
192.168.1.11 Ready 3d

kube-proxy

This component may be more or less irrelevant depending on the network configuration, but it is the standard Kubernetes gateway for all traffic between pods regardless of which host they are running on. The kube-proxy finds the target pods by using the services defined within the cluster. If traffic does not seem to be flowing to or from pods on a node but the node is showing as active in the cluster status, the kube-proxy is a great place to start the incident response. The standard log is located at /var/log/kube-proxy.log.

Container Runtimes

Different issues can occur depending on your container platform of choice and the default on the distribution or managed offering that you are using.

Docker and its containerd-based products are by far the most common runtime for containers used in Kubernetes clusters. Common issues involving docker-storage are the launch of the wrong container version and a lack of sufficient disk space for downloading container images. Both of these issues can be resolved by clearing the Docker cache located on the individual node. Another common problem is the failure of a container to launch on certain nodes although no errors are reported to the Kubernetes cluster. If this happens, you should review the logs for containerd, since kublet uses containerd to handle the full lifecycle of containers on each node.

Podman is the most recent implementation of a container runtime for Linux, and it has the fewest moving parts, but it is really only used by Red Hat. The biggest architectural difference is that it has no daemon, but the CLI more or less mimics the equivalent Docker commands.

Windows containers were originally based on work done jointly between Microsoft and Docker, but they have since evolved. While traces of Docker's architecture and best practices can still be seen, it is very much a Microsoft-specific implementation.

Kubernetes addon components

DNS

Service discovery in Kubernetes happens in one of two ways: the first is through hard-coded variables that can not be dynamically updated, and the second is via DNS records. Obviously, essentially everyone uses DNS because it is dynamic and more flexible.

Multime components are involved in order to use DNS to support service discovery successfully. Troubleshooting can be very different depending on the products used. You can have DNS configured within each running pod, node, cluster, or simply have one big external DNS. The best option for you depends on your scenario and sensitivity to risk.

One common solution is to have containers use the local host for DNS resolution. The local host is configured with dnsmasq or an equivalent to handle DNS routing and caching. By handling DNS on each host, the cluster is more tolerant to minor disruptions on the control plane. The goal of the DNS routing is for external DNS requests to be sent to the proper external networks and for requests for internal services (typically using the dot svc domain) to be resolved using the DNS records managed by kube-controller-manager.

Resolving issues at this level can be as easy as testing dns lookup from the command line of each node to see which one is giving the wrong answer and then investigating the configuration on that specific host.

Web UI

A web UI (regardless of whether it is the default Kubernetes Dashboard or something more complicated like Cockpit or Rancher) typically runs as a service that manages a few pods within the Kubernetes cluster. Most incidents can be handled with standard application tactics. The biggest exception to this is when there is a problem with authentication, since it relies on a token from the cluster's RBAC sub-systems. Dashboard uses either a straight kubeconfig or a bearer token generated for an individual service account.

If the web UI is more complex (like Cockpit), then it can use more advanced authentication such as certificate-based authentication or an external identity provider (which could be anything from an htpasswd file to a full enterprise class SSO platform).

Load Balancers

Kubernetes clusters are increasingly using external load balancers on most public clouds by leveraging what the cloud offers (like ALB on Azure) since they are fully managed offerings. Depending on your requirements and where your cluster is deployed, there could be thousands of clusters leveraging load-balancers running within the Kubernetes cluster (ex: nginx, haproxy) as well as on-premises clusters using existing application delivery controllers outside of the cluster (ex: F5, NetScaler).

Load-balancers are used to handle traffic that is ingress to the Kubernetes cluster and routed using defined services which map to pods on the individual nodes. In the future, there may be even more uses for the combination of intracluster load-balancing with service meshes.

Operators

While operators aren't technically addons to Kubernetes, they are not a core function in most distributions. An operator is essentially a custom application-specific controller that knows how to create, manage, upgrade, and destroy instances of that application. Operators can be written using Helm, Ansible, or Go. If an incident occurs around an operator, it is best to contact the providing vendor and upgrade to a later release. While every operator is built using the same SDK, they have enough individual nuances that it is crucial to engage internal development or the vendor.

Service Mesh

This functionality is an addon for Kubernetes, even though it is commonly deployed and becoming standard.

Incidents that involve the service mesh will often be the result of misconfiguration or a problem with the side-car proxy. As every pod will automatically have a sidecar proxy loaded and all network traffic will pass through it, this is the best place to start diagnosing issues.

Other common issues relate to the way in which traffic routing is configured when new versions are being deployed. Examples include when new instances are not rolling out fast enough to handle the increasing traffic volume or when old instances are being shut down before the traffic has been fully quiesced.

Since most current service mesh products can seamlessly integrate with the cluster and application diagnostic tooling, managing the incident will follow a fairly standard flow. Issues that don't involve the redeployment of entire applications can only be resolved by the SRE team.

Supplementary infrastructure services

Container Registry

While some may consider a container registry to be a core service of Kubernetes, it is only necessary for pulling new images in order to maintain the state within a running Kubernetes cluster, and there are no requirements for a registry.

Internal corporate guidelines will differ widely, and there are quite a few options on the market, including Quay, Docker Hub, GitLab, ECR, ACR, and GCR. If a container registry is causing problems, this will be easily identified during the deployment process. If that is the case, the two most common error messages are: version not found and invalid credentials. Less common errors include network timeouts and typos in the name of the requested object. In more advanced organizations with more security and quality controls, the deployment can fail if the proper labels are not applied to the requested container.

Deployment Tools

The tools that are used as part of the CI/CD pipeline or SRE team to handle building and deploying infrastructures (such as Ansible, Terraform, or CloudFormation) or deploying code into a Kubernetes cluster (such as Azure DevOps, AWS CodeDeploy, or Travis-CI) will vary by environment. The DevOps and development teams are responsible for figuring out why things fail at these steps. It can be as simple as a missing dependency or as complicated as the deployment of the incorrect version of runtime to the platform. Since containers can run on ARM, x86_64, Power, and even IBM Z systems, using the proper runtime can make a huge difference.

Teams involved

While every team should be able to see the monitoring and alerting data that other teams are receiving, different teams will be better equipped to handle different types of components during an incident response. There will always be an operational center of excellence (CoE) that will be first in line to apply known fixes to known problems in any organization with any kind of scale. If they can not resolve the problem, they will escalate to higher tiers based on the type of incident. These operational centers can be anything from a command center, network operations center (NOC), security operations center (SOC), service desk, or even a dedicated application support team for high value apps.

The actual time that it takes to triage and escalate will be determined by a combination of factors including severity, client-specific service level agreements (SLA), the time of day, or even the day of the year (for example, 11:00PM on the night before Black Friday is more important than 2:00AM on any given Sunday to the average retailer.)

Kubernetes Platform Monitoring Samples

This section offers a view into various popular Kubernetes distributions and managed offerings methods for metric collection and log management. It will also cover the integration of native tooling with an external system Incident Response solution so that the metrics and events can be used to build alerts (which are a cornerstone of any incident response solution).

This is not an exhaustive list of Kubernetes platforms, but rather, a sampling of some of the most popular offerings. There are currently 40 Certified Hosted Kubernetes offerings and 58 Certified Kubernetes Distributions.

Google Kubernetes Engine (GKE)

Since it is the originator of Kubernetes, Google deserves to be first in this section. As with everything on Google Cloud, Stackdriver is the answer to all of your monitoring and alerting needs, since it both correlates logs and tracks the performance of all core offerings within the Google Cloud Platform.

Stackdriver can connect to and monitor components on AWS, but if you are going multi-cloud, then it is best to use proven third-party multi-cloud solutions.

AWS Managed Kubernetes Service (EKS)

AWS has an interesting relationship with Kubernetes. Kubernetes has become the industry standard orchestration engine for containers, but while AWS has made it available to customers through their EKS offering, it still prioritizes development and marketing of its in-house container management offering, ECS.

The easiest way to monitor the control plane of EKS for errors is to leverage CloudWatch for logs and CloudTrail to capture all of the API calls within EKS. To expose metrics on the worker nodes for use in monitoring AWS, a tool called metrics-server needs to be enabled and deployed into the Kubernetes cluster. As is standard with many Kubernetes offerings, AWS recommends Prometheus for centralizing the tracking and trend analysis of the metrics gathered across the worker and control plane nodes.

Connecting from AWS native tools to external solutions for notifications is almost always done by configuring SNS topics. This allows for pushing data to webhooks, SMS, and email.

Azure Kubernetes Service (AKS)

Like most products running on Microsoft Azure, AKS has leveraged the Azure Monitor to correlate and generate alerts. While not as mature as other Kubernetes monitoring solutions, Azure Monitor has the capability of monitoring all components of the cluster including application health. Since Azure Monitor is fully integrated with the Azure cloud, it can also pull the metrics for the underpinning infrastructure, including the virtual machines which the cluster is running on.

Microsoft Azure Monitor is the centralized point in the Azure public and government clouds which handles monitoring metrics and generating any required alerts. (It is still partially in preview mode as of this writing.) The on-premises version is called Azure Operations Management Suite and is part of the Microsoft Systems Center family.

To integrate with Azure Monitor, you create an "Action Group" and define the webhook (or email address, etc.) which is the desired endpoint for any alert notifications. The action group is then attached to an alert rule based on any of the metrics that are enabled across the entire Azure cloud platform.

Red Hat OpenShift Container Platform (OCP)

Red Hat OpenShift is a product suite with multiple offerings based on Kubernetes and several other open-source products. It is a complete solution for enterprises looking to get up and running with containers, since it has all of the tooling required to go from source to production and can also update itself.

The difference between the various products in the portfolio (such as ARO and OCP) is whether the deployment model is hosted or on-premises.

With all of its components as well as its reliance on operators to handle updating and self-healing, there are a lot of moving parts that will need to be watched and managed. OpenShift includes two tools that will be used as part of an incident response solution to generate alerts. It uses Prometheus for event monitoring and trending, and ELK to do log consolidation and visualization.

The primary way to integrate OCP with Incident Response solutions is using the Alertmanager plugin for Prometheus. It is done by creating a route and receiver in the Alertmanager configuration.

Docker Kubernetes Service

There are two prongs to the Docker Kubernetes offerings. One prong focuses on the desktop through Kubernetes in Docker (called kind) or Docker Desktop. This is great for testing but will not be used beyond the desktop.

The second and more important prong is Docker Enterprise. This product is transitioning from Docker's proprietary orchestration engine (called Swarm) to a fully certified Kubernetes distribution while maintaining all of the tools that people are familiar with, such as docker-compose.

Docker Enterprise includes a customized set of tooling that provides dashboards and the ability to export other metrics. These can be integrated with more common monitoring and management tools like Prometheus and Splunk, and they can be leveraged to create alerts which would form the basis for new incidents.

Cloud Management Platforms

Cloud management platforms offer a way to abstract Kubernetes management across multiple clusters and clouds. Rancher offers multiple products for Kubernetes, from its namesake main offering to its popular k3s project. It's a cloud-independent way to create and manage Kubernetes clusters using its own distribution on premises. If you want to host applications across multiple public clouds, these platforms will allow even small organizations to leverage and integrate offerings from providers like GKE and EKS.

Platform9 also prides itself on flexibility as a multi-cloud manager, and it extends to virtual machines in addition to Kubernetes. Combining this functionality with the ability to aggregate logs and integrate with leading application and log management tools, it has the ability to support organizations of any size moving towards a cloud native future.

All of these products typically build their tooling around open-source products like Prometheus and Grafana in order to expose real-time metrics and provide alerting. They will often enhance base open-source projects and tailor them to their offerings (such as Cortex from Weave Works). These custom solutions will still integrate with any modern Incident Response product offerings.

What is Splunk?

 

This posting does not necessarily represent Splunk's position, strategies, or opinion.