Sumo Logic Inc.

07/15/2021 | Press release | Distributed by Public on 07/15/2021 16:31

Monitoring Apache Kafka Clusters with Sumo Logic

Apache Kafka® is one of the most popular streaming and messaging platforms, commonly used in a pub-sub (publish-subscribe) model, where consumer software applications send data via messages that producer software applications can consume. Teams use Kafka for a variety of use cases, including monitoring user activity, sending notifications, and concurrently processing streams of incoming data such as financial transactions.

Apache Kafka is deployed as a cluster that consists of multiple servers called brokers to receive and store incoming messages from consumers and send them to producers when requested. Zookeeper nodes store important information about the Kafka clusters as well as consumer client details.

Given the size, complexity and critical nature of Apache Kafka in production, it is critical to understand the internal state of the Kafka cluster, and its impact on critical applications and services.

We are therefore happy to announce the availability of the Sumo Logic Kafka app to monitor the availability, performance and resource utilization of Apache Kafka clusters across brokers, Zookeeper nodes and topics with 17 out-of-the box dashboards for visualization and 11 out-of-the box alerts for proactive notifications.

With this app, you can now:

  • Centralize the collection of all telemetry data from Apache Kafka clusters running on-premise, cloud or in Kubernetes environments

  • Get automatically notified when critical conditions occur

  • Comprehensively monitor Apache Kafka clusters and nodes

  • Get detailed insight into the performance and operations of Apache Kafka topics and replication

How does it work?


The Sumo Logic Kafka app supports the collection of log and metrics telemetry data from Apache Kafka clusters running in multiple hybrid environments.

Sumo Logic uses Telegraf for the collection of metrics from Apache Kafka clusters. Telegraf is an open-source data collection agent for metrics and uses built-in input plugins to fetch metrics from software applications such as Apache Kafka, and output plugins to send collected metrics to another system, such as Sumo Logic. We use the Kafka Jolokia input plugin to collect metrics from Apache Kafka clusters.

While metrics are collected via Telegraf, log data is collected either via agent-based installed collectors or via our standard Kubernetes collection methods in Kubernetes environments.

For more information on how collection in both Kubernetes and non-Kubernetes environments works, see:

Collect Kafka Logs and Metrics for Kubernetes environments and

Collect Kafka Logs and Metrics for Non-Kubernetes environments

Using the Sumo Logic app

Once data collection has been set up, the next step is to analyze it via dashboards and get notified when critical conditions occur via alerts.

Alerts for Apache Kafka

Pre-packaged alerts enable you to get proactively notified when critical conditions occur in your Apache Kafka cluster. These alerts are based on Sumo Logic Monitors, which allow you to set robust and configurable alerting policies that enable you to get notified about critical changes or issues affecting your production application.

Monitors for Kafka leverage metrics and logs, and include preset thresholds for high resource utilization, disk usage, errors, failed connections, under replicated and offline partitions, unavailable replicas, consumer replica lag and other critical conditions.

Monitoring Apache Kafka Cluster Operations

While running Kafka in production, it is critical that you monitor the overall state of your Apache Kafka clusters. The Kafka - Cluster Overview dashboard gives you exactly this with an at-a-glance view of your Kafka deployment across brokers, controllers, topics, partitions and zookeepers.

You can use this dashboard to identify brokers that don't have active controllers, analyze trends for request handlers to determine if additional resources are required and to monitor the number of leaders, partitions and zookeepers across each cluster to ensure they match with expectations.

The Kafka - Outlier Analysis dashboard analyzes trends to quickly identify outliers for key Apache Kafka performance and availability metrics such as offline partitions, partition count, incoming messages and outgoing bytes across your Kafka clusters.

Monitoring Apache Kafka Broker Nodes

As part of monitoring your production Kafka clusters it's helpful to understand how individual Broker nodes are performing critical tasks. Our Kafka - Broker dashboard provides an at-a-glance view of the state of your partitions, active controllers, leaders, throughput, and network across Kafka brokers and clusters. You can use this dashboard to quickly identify if a Kafka broker is down or over utilized by monitoring replicated and offline partitions and out-of-sync replicas. You can also use this dashboard to determine if a broker is performing key Apache Kafka related tasks as expected by monitoring network throughput, incoming message rate, request rates and log flush rates.

While metrics can help you quickly identify a problem with an Apache Kafka node, you can't use metrics alone to completely understand why the problem occurred in the first place.

The Kafka - Logs dashboard helps you quickly analyze error log messages across all brokers in a cluster. You can use this to identify critical events in your broker and controller logs, examine trends to quickly detect spikes in errors or in fatal events, monitor critical broker events (additions, starts, stops) and quickly analyze log message patterns via the Sumo Logic LogReduce algorithm.

If error logs don't give you the answer, you can further determine if a problem is associated with a lack of resources via a set of pre-built resource utilization dashboards.

For example, to understand how broker nodes are consuming critical resources such as disk, we have a Kafka - Disk Usage dashboard, which you can use to monitor disk usage, iNode bytes used and disk throughput across all Kafka Brokers. These are critical metrics as Kafka brokers store messages on disk and a heavily occupied or an under-performing disk could either bring your Apache Kafka deployment down or cause major performance bottlenecks. You can also use this dashboard to analyze trends in usage and throughput to determine if your cluster is over or under-provisioned.

Similarly we also have detailed dashboards that monitor CPU and memory utilization, garbage collection, threads as well as class loading and compilation on broker nodes.

Monitoring Apache Kafka Zookeeper Nodes

Apache Kafka Zookeeper nodes store important information about Kafka clusters as well as consumer client details. The Kafka - Zookeeper dashboard helps you get insight into the health and performance of your Kafka Zookeeper nodes by monitoring key Zookeeper metrics such as disconnect rate, authentication failures, session expiration and connection rate.

Monitoring Kafka Topics and Replication

Kafka Topics are used to categorize messages and are broken down into a number of partitions. To understand the state of the system, it is necessary to understand how data is being sent across topics and how partitions are being replicated, in addition to monitoring broker nodes.

The Kafka - Topic Overview dashboard helps you visualize incoming bytes by Kafka topic, server and cluster as well as quickly identify under-replicated partitions in topics. Under-replication partitions either do not have enough replicas to meet the desired replication factor or are partitions where one or more of the replicas have fallen significantly behind the partition leader. Common causes of under-replicated partitions are either unresponsive brokers, or the cluster is experiencing performance issues and one or more brokers have fallen behind on replication.

We also have the Kafka - Topic Details dashboard that gives you insight into throughput, partition sizes and offsets across Kafka topics. You can use this dashboard to monitor trends of critical metrics like partition log metrics, offline/under replicated partitions, In Sync replica (ISR) shrink and expand rates. Increased rates in either of these metrics can indicate either a degradation in performance or lack or limited availability of the Kafka cluster.

In addition to monitoring replication by Kafka topics, you can also monitor replication performance by cluster and server with the Kafka - Replication dashboard to identify broker nodes that are either causing or affected by replication failures.

Get Started Now!

The Sumo Logic app for Apache Kafka is a unified logs and metrics app that provides visibility into the availability, performance and resource utilization of Apache Kafka clusters. Preconfigured dashboards provide key insights into the operations of clusters, brokers, Zookeeper nodes, topics, while pre-packaged alerts proactively notify you when critical conditions occur.

To get started, check out the Sumo Logic Kafka App help doc If you don't yet have a Sumo Logic account, you can sign up for a free trial today.

Additional Resources

Download the Sumo Logic Continuous Intelligence Report that quantitatively defines the state of the modern application stack and the shift in technology used by enterprises adopting Cloud and DevSecOps during the COVID-19 global pandemic.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.