Results

VMware Inc.

11/24/2021 | News release | Distributed by Public on 11/24/2021 09:33

Using Distributed Tracing and RED Method to Map API Dependency and Monitor Reliability

VMware Cloud Services runs secure, highly available, and reliable services. Internally, this is a collection of numerous teams and microservices. While we have full control over our microservices, there are always dependencies on both internal and external SaaS solutions. For such dependencies, we can monitor various metrics with the goal to make our microservices resilient and meet our service level objectives (SLOs). It is important to note that not all APIs are equal. For various business and technical reasons, we choose to create tiers of service. For example, login requires low latency. On the other hand, downloading files can be much slower.

In this article, we will describe how we use the Distributed Tracing and RED Method to assign API tiers. We will describe the way we correlate incoming (inbound) requests to the outgoing (outbound) requests and how we monitor the outbound requests.

Before diving in, let's make sure we know what Distributed Tracing and RED method is. Both are essential for distributed system observability.

Distributed Tracing

Logging, Metrics, and Distributed Tracing are the three pillars of observability. I will argue that the list here is ordered by the maturity level of the application. Mature and reliable applications should have all three pillars and provide a different angle of a problem that can arise.

Distributed Tracing is the ability to get a clear and holistic view over a transaction in a distributed system as the request travels between the different parts of the system. The request identifier is conveyed between the microservices and creates a parent-child correlation.

The full transaction is called trace and the sub-transactions are called spans. Looking at the trace and its spans can help quickly to pinpoint where the problem is. All the information is contextual, and this helps figure out what may cause an issue.

There are many vendors that provide distributed tracing instrumentation libraries, my advice is to choose an open standard. There are two main standards OpenTracing, OpenCensus both merged into the third standard, a CNCF incubating project OpenTelemetry which is a good thing, but unfortunately the instrumentation libraries are not available for all the programming languages and are not rich enough as of now. We currently use OpenTracing. The instrumentation libraries send the traces to the backend that also provide UI. There are open-source backend servers (to name a few, Zipkin, Jaeger). Some of them are commercial SaaS services such as VMware Tanzu Observability by Wavefront, the one we use

RED Method

RED Method is an observability pattern used to monitor distributed systems such as in microservices architecture. The method was published by Tom Wilkie. The name is borrowed from Brendan Gregg's USE Method which is a nice way to monitor resources such as CPU memory etc.

RED stands for:

  • (R) Request Rate - The number of requests per second
  • (E) Request Errors - The number of failed requests per second
  • (D) Request Duration - A histogram of request duration

When looking at the RED metrics one can quickly identify whether the system is slow, has a lot of errors. You can also look at each point's RED metrics in your distributed system without knowing what it does, you can quickly identify where the bottleneck is.

The great power comes when you can slice and dice the numbers according to the operation (span) type or any custom tag you can imagine.

For example:

  • RED metrics for HTTP server calls in "Orders Microservice"
  • RED metrics for HTTP POST calls in "Email Microservice"
  • RED metrics for consumption of Apache Kafka "orders" topic by "Shipment Microservice"
  • RED metrics for JDBC calls by "Customers Microservice"
  • RED metrics for RPC calls from "Orders Microservice" to "Email Microservice"
  • RED metrics for all the application

And many more…

We use VMware Tanzu Observability by Wavefront and it derives RED metrics from traces and spans. It automatically aggregates RED method numbers for tags like HTTP status code, span type, etc. You can instrument the application to send your custom tags and indicate in your SDK to aggregate RED metrics numbers for your custom tag.

Dashboard

In order to explain better, let's look at the result shown in our monitoring system table charts. We have 3 tables. Each table has 3 columns:

  1. Inbound API - HTTP endpoint, the Inbound call we expose to our customer. It contains the tier, controller method name, HTTP verb, and URI pattern
  2. Outbound API - The 3rd party dependency our endpoint has, Including name, HTTP verb, and URI pattern.
  3. Value - Depends on the table.
    1. The client P95 percentile execution time - here we can see which outbound 3rd party API has a long execution time contextually to inbound API.
    1. Count - it's a simple counter, based on the time range we specified in the filter above, we can see how many times we called outbound 3rd party API from contextually to inbound API.
    1. Error Count - The number of errors in outbound 3rd party API contextually to inbound API.

Instrumentation

Let's see how you can add such instrumentation to your project in order to collect the numbers.

The code snippets below use Java with Spring Boot and Opentracing libraries. The concept is applicable to other programming languages and other monitoring systems.

First, you need to annotate the REST controllers with the API tier level. the annotation will be used to correlate the outbound call to the API tier level it is serving. You don't have to annotation to all your APIs, if it's missing you can define a default, for example, tier2.

When a request reaches the HTTP client - you need to use the hook point you have before calling the API where you can tag the outbound span with inbound tags and outbound tags. In our case, we use a Spring RestTemplae, where we created a custom RestTemplateSpanDecorator and override onRequest(), where we extract baggage items from the context and add them to the client call span as tags.

Having the tags on spans allows our tracing system to aggregate the numbers.

Next, create a chart to represent the result. In our case, we created a tabular chart using WQL syntax and the tags from our instrumentation mentioned above.

Caveats

Tag or label cardinality is limited in monitoring systems. Make sure you keep the tag set small as possible. High cardinality may slow the performance and aggregations of the monitoring system you have.

For example, if your inbound URL is "/customers/123", when adding the tag, make sure to use path pattern - "/customers/{customer-id}"

Summary

In this article, we described how we added instrumentation to our code to enrich our traces with additional information. This information helps us to see the relationship between inbound and outbound API and see potential issues in outbound API contextually to our use cases.