VMware Inc.

05/23/2022 | News release | Distributed by Public on 05/23/2022 23:51

Operating Performance Analytics Machine Learning Services on Any Cloud with VMware Tanzu

By Chien-Chia Chen, Tzu Yi Chung, Chengzhi Huang, Paul Huang, and Rushita Thakkar

VMware'sPerformance Engineering teamdevelops and operates many critical machinelearning(ML)services across the VMware product portfolio.As these MLservices scale, more operational challenges arise-from multi-cloud operations,and offline ML pipeline monitoring and alerting,to online inference, model,anddatamonitoringand alerting.This blog shares howwe leverageVMware Tanzuandthe open-source ML stackto streamline the operations ofmulti-cloudMLservices.

Our Story

ML plays an important roleat VMwareinbothVMware's SaaS operationsas well as development lifecycle. In our earlier post, we presented how we leverage VMware products to scale the performance analytics data infrastructure beyond million time-series data(numerical readings of performance metrics measured periodically).Besides the data infrastructure, more operational challenges also arise as we scale the downstream ML services.Figure 1 showsourtypical ML service lifecycle, which consists ofthe offline components hosted in the on-premisesclusters and the online inferenceservices that may run on any cloud to the target application.Each of these components hasdifferent operational challenges at scale.

Figure 1.Multi-cloud MLoperations(MLOps)lifecycle

The two primary offline components are feature preprocessing and model training. The preprocessing jobs query features from the feature store, which is a part of the data infrastructure, and stream the transformed features to the training jobs. Kubeflow orchestrates these preprocessing and training jobs.

The biggest operational challenge when running preprocessing and training jobs at scale is to properly monitor and handle failures to ensure jobs do not get stuck indefinitely. Never-ending jobs were once a common problem in our production cluster, whether it was due to bugs in the code (such as a wrong API token) or cluster misconfigurations (such as missing required persistent volumes).

We channel relevant Kubernetes pod metrics such as resource utilization (CPU, memory, network), and time spent in different states (pending, failed, running, etc.) to Prometheus, which is a monitoring system and time-series database. A set of Grafana dashboards monitor these stats. Figure 2 shows an example of the training job dashboard. We also configured a set of Prometheus rules to warn or alert when certain failures that require engineer intervention occur, such as a cluster is overloaded, or a job remains in a state for an unreasonably long time (pending for hours or running for days).

Figure 2.Dashboard for monitoring training jobs

Once the models are trained, the continuous integration and continuous deployment (CI/CD) pipeline pushes the model image to the model registry on Harbor and reconfigures the downstream inference services to promote models to production if specified. The primary operational challenge in this phase is that we need to run the inference services close to the target product, which can be either on-premises in VMware's data centers or in public clouds such as VMware Cloud on AWS. To streamline this multi-cloud deployment and operation, we package all the required Kubernetes resources using Carvel tools. The same Carvel package repository can then be deployed on any cloud using VMware Tanzu. Once deployed, all the downstream inference services are then managed by the CI/CD jobs to load their required model images to serve the inference requests.

The two most critical operational metrics of the online inference services are their service up-time and the inference response time. Our inference services are based on the open-source Seldon Core and consequently we take advantage of the seldon-core-analytics library to ingest these metrics to Prometheus. Several Prometheus rules are also configured to alert service owners in cases when, for example, service outages occur, long-lasting high inference latency is detected, etc.

Figure 3.Dashboard for monitoring inference services

In addition to the above challenges in operating individual componentsof the MLOpspipeline, it is equivalently challenging to monitor and react to data distribution drifts and model performance driftsover time. Our paper presented in Data-Centric AI Workshop'21describes the challenges and the approaches we take to tackle them in detail [1].In a nutshell, we monitor the KL-divergence of themonthly data distributions, as shown in Figure 4. When the data distribution differs over a certain threshold, an automatic model retraining will be triggered. seriesof offlineevaluationbefore promoting a new model to production.

Figure 4.KL-divergence of monthly data distribution

Conclusion

As we see an increasing demand in running an application on any cloud, we share in this blog how VMware Tanzuhelps us streamline the multi-cloud operation for our MLOps pipeline. Figure 5 below summarizes the end-to-end tech stack of our production performance analytics MLOps pipeline, which is packaged as a Carvel package repository that can be deployed and operated through VMware Tanzu. In addition to multi-cloud operation, it is important to monitor and alert each component throughout the end-to-end MLOps lifecycle, from Kubernetes pods and inference services, to data and model performance. Our multi-year experiences show that in order to have a successful long-term ML service or product, these operational challenges must be all addressed seriously.

Figure 5.MLOpsplatform on VMware Tanzu

References

[1] X. Huang, A. Banerjee, C.-C. Chen, C. Huang, T. Y. Chuang, A. Srivastava, R. Cheveresan, "Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance," in Data-Centric AI Workshop, 35th Conference on Neural Information Processing Systems (NeurIPS), December 2021, Virtual.