Expedia Group Inc.

09/07/2021 | Press release | Distributed by Public on 09/07/2021 08:18

Improving First Input Delay by Leveraging gRPC... (Opens in new window)

Improving First Input Delay by Leveraging gRPC

For Vrbo landing pages, performance and improving response times is always a priority. Whether it's a real user finding a destination through a web search or a bot crawling web content, we need good response time to improve engagement and SEO.

Within the many calls needed to get the whole content of a landing page, the first one is link a path to its destination id in the system so this response is a blocker for the rest of the calls that depends on the destination identifier and it's crucial how this first call responds considering metrics like First Input Delay (part of Core Web Vitals), which measures the time from first user interaction with a page to the browser being able to process handlers in response to that interaction. In the context of landing pages, the improvement of first call performance could have a direct relationship with landing pages First Input Delay improvement.

With that in mind, we started thinking on adopting gRPC in our platform beginning with that service. We built a gRPC server to replicate our http service, and we wanted to compare their performance to understand if pushing for gRPC in our platform and send more traffic to the gRPC version was something that could really improve our performance or not.

Our metrics and Datadog dashboards indicates that gRPC performance was quite promising, so we want to use a load test to stress both of our options, http and gRPC, with the same configuration and datasets to check their limits and differences.

Hypothesis

gRPC is designed for low latency and high throughput communication, which makes our service the perfect candidate to benefit from gRPC.

We've implemented gRPC server in our service, added metrics and started to serve production traffic controlled by an AB test; and now we want to compare http and gRPC performance under the same configuration and input data.

We're expecting that the test reports will show an improvement in both latencies and throughput.

Application under test

The application subject to this test has two operations: lookup for destinations data (identifiers and attributes) for a given path and reverse lookup (getting a path for a given identifier).

It gets its data from two sources:

  • The primary one is RocksDB, populated with our paths Kafka topic.
  • As fallback if the request is not found in RocksDB, our Cassandra database.

Our http service response is a json payload. gRPC returns its response as a binary object, which usually has a smaller size of a json containing the same data. Considering the size of our json payload we don't think we're obtaining much improvement from the gRPC version, but operations with bigger payloads than ours could benefit more from using it.

Application performance

For measuring latencies we are using p99 and p95 metrics. These are percentile values that indicate the upper threshold for the defined percentage. E.g. a p99 of 35 ms indicates that the 99% of the calls are taking 35 ms or lower. Both lookups were originally build as http endpoints with quite good performance: