VMware Inc.

11/21/2022 | News release | Distributed by Public on 11/21/2022 10:43

Project Trinidad: Merging Security with Modern Application Observability – Part 3: ML-Based Workload Security

Most production Kubernetes clusters are a melting pot of diverse, complex, highly interconnected services. Despite the availability of several systems for monitoring these clusters, securing them remains challenging.

A famous saying in the business world teaches us that "you can't manage what you can't measure." In the Kubernetes world, measuring service functionality (uptime, latency, resource usage, etc.) is somewhat straightforward. However, to secure a service, we must adapt and extend the saying: you can't protect what you don't understand. Only if we understand what is happening in our clusters can we meaningfully safeguard them. But unfortunately, many operators struggle to have a good understanding of their clusters.

In the first two posts of this blog series, we already showed how Project Trinidad helps DevOps and Security teams monitor modern applications without changing a single line in the application code. In this third post, we describe how we automatically infer cluster behavior from the collected logs and how we can leverage this understanding to secure our applications and workloads.

Understanding workloads using network traffic

As we described earlier, the goal of Project Trinidad is not (only) to capture and visualize network traffic but, more importantly, to use this traffic to get a detailed understanding of the workload and alert when we see traffic that indicates a security problem.

One of the critical points for achieving this is understanding the APIs used within the deployed applications. Without a good understanding of how external clients interact with our internet-facing APIs and how the application interacts internally on its APIs, providing understandable and actionable information to users is impossible.

Quick Detour: What are API schemas?

To give the reader context for the later sections, let's take a short detour on what API schemas are. On a high level, API schemas describe how clients should interact with an API: what are its functions, what arguments do the individual functions take, and what can a client expect in response to invoking a particular method. Essentially, they are the core of any technical API documentation and provide an overview of the capabilities and behavior of an API.

There are many ways an API schema can be represented, but one of the most popular ways is using the OpenAPI Specification (originally proposed by Swagger and then standardized by the OpenAPI Initiative). A simple example for the specification of a single method of an API is found in Figure 1:

Figure 1: Partial example API schema as OpenAPI Specification

As we can see, each individual API method is described using the URL at which it is accessible, followed by a description of the method, details on the parameters accepted by the method, and different types of responses emitted when invoking the method.

API schemas are immensely powerful for many use cases. Not only can they be used by humans to learn how to use an API, but there are also many tools available that allow auto-generating code for interacting with them programmatically. And as we will describe in more detail later, API schemas are also an essential building block for Project Trinidad for understanding interactions between microservices.

API Schema Discovery

While API schemas for external APIs are (usually) available, they can be incomplete or outdated, and internal APIs - more often than not - remain undocumented. Given how vital API schemas are to understanding the workload behavior, Project Trinidad can discover the schema for all workload APIs using the observed network traffic - and it does so automatically.

Interestingly, getting an understanding of APIs deployed within a cluster is not only useful for the internal workings of Project Trinidad. It is also incredibly valuable for Operations Teams who very often lack a clear understanding of application behavior, as recently highlighted by a Gartner analyst:

Thus, in a first step towards understanding API traffic, the Project Trinidad backend analyzes the captured API traffic to build a model that approximates the API schema of all external and internal APIs observed in the traffic. This schema is continuously refined as more API traffic is captured over time, and it exposes information such as:

  • All API endpoint URLs invoked by clients,
  • Any parameters embedded within the API URLs (e.g., if the URL path is not static but instead contains API parameters within the path),
  • Query arguments (appended to the endpoint URL),
  • Arguments encoded in the request payload (using various encoding mechanisms),
  • Request headers that are used to transport information to the API endpoints (such as authentication credentials, session tokens, etc.), and
  • Responses emitted by the API depending on how it is invoked.

We are currently actively working on exposing the generated API schemas to users in two ways:

  1. As a standalone artifact (e.g., formatted as OpenAPI Specification), helping Operations Teams to understand and visualize the APIs, providing a powerful tool during development and troubleshooting.
  2. Through annotations on API traces with the corresponding API schema, allowing users to seamlessly switch between live-traffic analysis and inferred API specification.

Machine Learning on API traffic

The inferred API schemas are helpful for a user's understanding of their workloads and an integral entry point for automated data analysis using machine learning (ML) techniques.

Our data analysis pipelines leverage the discovered API schemas to understand which API endpoints are invoked and detect different types of anomalies. Point Anomalies cover abnormal behavior in how individual APIs are invoked, whereas Sequence Anomalies cover anomalous behavior of which APIs are being invoked; we describe both types in more detail below.

Point Anomalies

Point Anomalies identify suspicious behavior by looking at a single API invocation and correlating the observed data (e.g., API parameters) to what the ML previously identified as normal.

To provide a very basic example taking the API schema snippet from above, our ML would have learned that the API endpoint may take an optional parameter count and that it (when it is set) is always a number bigger than 0. If an attacker used any value that does not meet this specification (e.g., using a negative value to trick the API into issuing a price refund), our ML would immediately flag this as a potential attack.

But Project Trinidad ML allows us to dive much deeper than merely doing simple schema validation. By observing large volumes of API traffic, we can learn what combination of API parameters are typically used to trigger a particular type of response from the API, providing a deep understanding of the API semantics. For example, we may learn that an authentication API always requires credentials in the form of a username and password for doing a successful login and that usernames, as well as passwords each, follow a very particular pattern (e.g., usernames are always email addresses and passwords have a certain set of allowed characters and minimum length):

------------------

Example API requests showing expected/benign invocations of the authentication API - one
successful and one rejected API invocation, respectively.

If we observe an API call that does not match this expected behavior, we can immediately raise an alert that the API is not behaving according to its observed specification:

Example API request showing anomalous invocation of the authentication API. The call is missing the password parameter, but the response nevertheless returns HTTP-200 and a session token, indicating a successful call despite being provided invalid credentials.

A deep understanding of API semantics allows Project Trinidad to detect a wide range of attacks against our workloads. And we enable this without requiring the user to provide any information on the intended use of the APIs.

Sequence Anomalies

A more intricate type of anomaly that our ML covers are unexpected sequences of API calls. These can be anomalous sequences in which external API clients use the exposed APIs as well as calls to internal API in the workflow. For this to be possible, the Project Trinidad ML components learn the expected interactions between clients as well as internal microservices and represent this knowledge as a graph of API endpoints and the sequence in which they can be accessed.

Let's again consider a simplified workload example of a shopping cart application. This workload requires users to first authenticate to the API and then allows them to add articles to a shopping cart, pay for them, and eventually trigger shipping. Additionally, let's assume that each API invocation is internally tracked for auditing purposes. We can visualize this workflow as follows:

The understanding of the expected workflows allows Project Trinidad to detect anomalous interactions. For example, if an attacker is able to trick one of the external APIs to skip logging the method invocation in the internal auditing service, this is considered an anomaly that triggers an alert:

Likewise, if an attacker is able to add articles to the shopping cart (step 2 in the workflow) and later trigger shipping of the articles (step 4) without going through payment (step 3), this is also considered an anomaly, as the workflow advanced without following the learned, expected sequence of events:

Detecting Attack

On initial deployment, Project Trinidad monitors incoming API traffic to gather an understanding of the API schemas and workflows of the workload. As soon as the ML models are trained to an acceptable accuracy, we begin live traffic inference. In this context, acceptable accuracy means that we have observed enough traffic to infer robust models that generate actionable and dependable predictions - after all, a system that cries wolf too often is as bad as (if not worse than) a system that does not alert at all. Furthermore, we periodically evaluate if the generated models still match the live traffic to detect when the workload may have changed (e.g., after new services or versions were rolled out) and models require retraining.

In live traffic inference, every API call reported to the backend is matched against the ML models to detect whether the call matches the observed, expected behavior or diverges from it. Whenever a diverging call is found, it is annotated with information on why the system believes the call to be an anomaly, and it is linked with API schema and workflow details to help the user understand this divergence. For example, by seeing how "normal" API parameters or sequences look and how a specific call diverges from this model, users can intuitively understand how to respond to such an anomaly. Live traffic inference is typically very fast (in the order of a few seconds) and thus allows alerting on anomalies in near real-time.

In addition to annotating API traffic, we are actively researching what other ways of handling an anomaly provide value to users of Project Trinidad.

Wrapping Up

As we can see, Project Trinidad helps teams running modern applications on Kubernetes regain control over their clusters and answer crucial security questions. We use a zero-instrumentation approach to collecting workload traffic and automatically analyze the collected data to learn normal application behavior and highlight outliers.

By focusing on what is normal versus what is not, we can identify attacks against the application, regardless of whether this is a previously known attack or using a 0-day exploit. By annotating and visualizing the captured data and cross-linking with the automatically discovered API schemas and workflows, we can put the cluster operators back in control over the overly complex and highly heterogeneous workloads that many modern deployments have evolved into.

What's Next?

VMware Project Trinidad is currently in tech preview, which means it is still under heavy research, design, and development. We are actively investigating how we can best expose results to our users, improve the accuracy of our analysis and ML components, and what other systems we can integrate with for capturing data and reporting anomalies.

This gives early adopters an excellent opportunity to influence where Project Trinidad is headed: we are actively seeking design partners and would welcome the opportunity to start a conversation by walking you through the solution in more depth. If you are interested in collaborating, please reach out to [email protected].