Splunk Inc.

04/19/2024 | News release | Distributed by Public on 04/19/2024 09:55

Don’t Live in the Past - APM 3.0 and Why You Need It

"You're living in the past, man! You're hung up on some clown from the 60s, man!"
- Eric The Clown

"Observability-washing" APM

Application Performance Monitoring (APM) as a discipline and as a collection of supporting technologies has evolved rapidly since a distinct recognisable market for APM products first emerged in the 2007 - 2008 time frame. While there are many who would argue that APM has mutated into or been replaced by Observability, it makes more sense to see APM as one of many possible use cases now able to exploit the functionalities that Observability brings to the table - particularly when combined with AI. In fact, the failure to distinguish APM from Observability has, unfortunately, allowed vendors of more archaic forms of APM products and services to 'Observability-wash' their offerings. After all, if APM has evolved into Observability then, by a marketing sleight of hand, everything that was APM last year is now Observability, almost by definition!

Of course, the main problem here is not the fact that vendors are making dubious claims, it is, instead, that there is genuine confusion in the user community which leads to the inappropriate deployment of technology and a consequent undermining of an enterprise's ability to deliver business value to its customer base. To dispel some of this confusion, let us try to get clear on the actual relationship between APM and Observability and articulate a set of questions one can ask in order to determine whether or not a prospective solution is providing APM genuinely enhanced by Observability or an older form of APM dressed up in the language of Observability.

Proto-APM

It is possible to analyse the history of APM into four distinct stages. You might say that APM has endured four paradigm shifts if you like that terminology. The first stage or paradigm could be called 'proto APM.' As digital business and customer-facing applications began to assume importance in the noughties (and became a source of anxiety in the wake of the 2007 contraction), businesses demanded that IT focus on application as well as infrastructure performance and availability. At the time, specialised APM tooling was not available so the vendors of the day, in trying to meet customer demand, repurposed existing event management platforms, dating back to the 1960s in terms of basic design and capabilities. Although the object of concern was now the application portfolio rather than the server or network infrastructure, the Big Four (IBM, HP, CA, and BMC as they were collectively called at the time) simply extended their infrastructure-oriented 'Frameworks'. Now, it is important to remember that the infrastructures of the day were topologically static, architecturally monolithic, and behaviourally predictable. Unanticipated events were rare and invariably an indication that something had gone terribly wrong solely in virtue of the fact that they were unanticipated. Hence, the availability and performance systems built to deal with such infrastructures relied heavily on relatively simple predefined (and hence unchanging models), sampling, and spotting the rare exception. Applications were, it was believed, only mildly more volatile and modular than the infrastructures they ran on so targeting them with the Framework approach seemed like a plausible way forward. It should also be noted, however, that despite similarities in structure and behaviour, there were sharp boundaries between the application realm and the infrastructure realm and this will become an important driver of our history going forward.

APM 1.0

Users and vendors shifted to the second stage or paradigm, during the 2012/2013 time frame during which the first technologies designed from the bottom up to specifically monitor and manage end-to-end application behaviour came to market. Hence, we can justly call this stage 'APM 1.0'. The great migration to the cloud had begun across industries in North America and at a much slower although equally sure pace in EMEA and the Asia Pacific region and, while infrastructure management was initially seen as one of the main tasks of would-be cloud service providers, the management of cloud-based applications was widely considered to be the proper preserve of the businesses. This emerging division of labour was, of course, made possible by the sharp lines separating infrastructure and application mentioned above. In fact, it was that division of labour that made the cloud move palatable to the many enterprises that had concerns about loss of control over the increasingly central digital channel through which they interacted with their customers.

Once in the cloud, however, (or in anticipation of moving to the cloud), application architectures started to change shape. At first, they more or less retained the monolithic, topologically rigid features of on-premise applications, but gradually they became more modular, distributed, and topologically dynamic. As a result, not only did application behaviour become less predictable but the ability to infer end user or customer experience on the basis of knowledge of state changes within the application degraded considerably. As a result, the returns from proto-APM technologies diminished rapidly, at first regarding cloud resident applications but then, as cloud-inspired architectures came to dominate the developer mindset, across the corporate application portfolio.
In response to the changing requirements, a spontaneous collaboration among users, industry analysts, and entrepreneurs resulted in the definition/proclamation of a 'five-dimensional model' for APM. In order to deal with the realities of cloud-driven application architectures, APM products needed to support and loosely coordinate five distinct types of functionality:

  1. It must be possible to monitor end user or customer experience of application performance directly.
  2. It must be possible to dynamically discover the logical topology of an application in a manner that does not depend on some kind of hardware asset-based model.
  3. It must be possible to capture detailed code execution steps on application server virtual machines.
  4. It must be possible to trace the progress of user-defined transactions through the nodes of the discovered topology.
  5. A rich set of statistical and visual analytics tools must be available to deal with complex data sets generated by the other four functional types.

(It is worth mentioning that, despite the novelty of the five functional types the underlying technology closely resembled that of the Big Four's repurposed event management systems - data was sampled and packaged into short event records that were compared against a predefined model and alerts were generated following a mismatch between model and event.)

Following the dictates of the five-dimensional model, a new crop of vendors (most notably, AppDynamics, New Relic, Dynatrace, and eventually, DataDog) brought APM 1.0 products to market and, in very short order, displaced the Big Four in enterprises across the global economy.

APM 1.0 proved to be extraordinarily successful, so that, by 2015, approximately a quarter of all enterprise applications in North America, cloud-resident or not, and global spending on APM cracked the $10 billion mark. But applications did not stop changing shape and the very success of the five-dimensional model turned it into a straight-jacket for many users, particularly those charged with application development. Their frustrations with the products provided by the market and their creativity in seeking end run around APM 1.0 shortcomings led to the emergence of a new paradigm or stage: APM 2.0.

APM 2.0

First, let us take a look at what was happening with regard to application evolution and then it will be easy to see why DevOps practitioners, first, but then also many IT Operations practitioners staged a revolt against the five-dimensional model. Well, the rate of application modularisation increased as object orientation gave way to a more eclectic approach that, while not abandoning objects, introduced more and more functional constructions, wreaking havoc on the typing conventions of languages like Java. The focus shifted to the concept of micro-services, relatively small packages of function and method, that interacted with one another in a loosely coupled manner against an ever-changing topological backdrop. The components themselves had ever shorter lifetimes and, finally, applications were expected to be continually modified by development teams with the number of changes increasing by an order of magnitude in many large enterprises. Finally, once a microservice frame of mind had been adopted, the rigid borders between application and infrastructure began to break down. Yes, there were micro-services closer to the 'surface' where the application interacted with users or other applications which would unquestionably be recognised as 'application-like' components but those services called upon other services below the 'surface' and those services in turn called yet more deeply placed services. At what point did these components become infrastructure components? And remember they are all changing all the time and flitting in and out of existence with life spans, in many cases, measured in microseconds.

Returning to the five-dimensional model, it is now easy to see why its attraction faded rapidly. Its call for logical topologies delivered little value when topologies would shift structure in seconds. Deep dives based on byte code instrumentation provided a very limited perspective on application behaviour. They might provide insight on what was happening within a micro-service (if that service was 'big enough' to support the invasive instrumentation required) but when most of the application 'action' was taking place in the spaces between micro-services where messages were being passed, they were more likely to generate oversights rather than insights when it came to understanding end to end application behaviour. Transaction tracing functionality (dimension 4 of APM 1.0) might have come to the rescue here but popular implementations of this functionality involved the injection of rather large tokens into the code base, piling an invasive procedure on top of the invasiveness already required for deep-dive application server monitoring so that, by 2015, this type of functionality was rarely deployed in practice. Finally, the analytics that came packaged with most APM suites were woefully inadequate to the vast explosion in the size and complexity of the self-descriptive data sets generated by applications and harvested by the APM tooling.

Indeed, for both DevOps and IT Ops practitioners it became more and more apparent that APM should be treated as a big data problem and, as a consequence, these communities began to turn to technologies that gave them direct access to the underlying telemetry, without layers of intervening functionality intended to provide contexts for that telemetry. If a developer could easily capture and display all of the metrics or logs generated by an application, he could, perhaps with the support of effective visualisation, make sense of what was happening and spot troubling anomalies. Furthermore, since we were just talking about streams of telemetry, there were no concerns about topologies getting outdated or even about the application/infrastructure divide. Faced with an amorphous, ever-shifting population of interacting micro-services and functions, practitioners could just follow wherever the data led them.

And, so, without the sanction of analysts or vendors, by 2016, a new stage in the history of APM was reached, an APM 2.0 which was dominated by big data capture and analysis technologies, specialised according to different types of telemetry, most usually metrics or logs, and the growing perception on the part of users that vendors focused on metrics and visualisation (e.g., Grafana) or logs (e.g., Splunk) should play a central role in the management of applications. (It is important to stress that the rise of APM 2.0 was almost entirely a user-driven paradigm shift. Many of the providers of APM 2.0 technology only recognised the role that their products were playing in retrospect, if at all.)

If the APM 2.0 vendors were blissfully unaware of the revolution they were enabling, many of the APM 1.0 vendors were painfully aware of what was taking place on the ground. Fearing a reenactment of the almost overnight market shift that had allowed them to displace the Big Four began to patch telemetry ingestion and visualisation capabilities into their APM product portfolios. As a result, many of the APM 1.0 players morphed into what might be called an 'APM 1.5' status, offering their users what remained a fundamentally five-dimensional technology with some metrics and log management capability attached at the edges.

APM 3.0

Terminology began to shift at this point and users, opinion makers, and some vendors revived an old Optimal Control Theory term - Observability - to describe what they were doing with their telemetry ingestion and visualisation software. And there was definitely justice in using a language that did not suggest that their activities were restricted to applications. As indicated above, the line between application and infrastructure has become blurred. Furthermore, cloud-based infrastructures themselves were becoming increasingly modular, dynamic, distributed and ephemeral while users increasingly took back control of infrastructure configuration and management. In other words, infrastructure also had come to require (and receive) a big data-style treatment.

The pandemic arrived and, while many aspects of the economy slowed down, the evolution of APM did not. DevOps and IT Ops practitioners were now wallowing in huge volumes of telemetry unfiltered by finicky APM (or infrastructure management) technologies and while this was definitely seen as an improvement, the truth of the matter is that the need to understand and work with datasets so large and fine-grained brought a whole new array of challenges. First of all, the data sets proved amenable to sampling due to both the uncoupled and rapidly changing nature of the underlying systems generating the data and the high dimensionality of the data which rendered many traditional statistical techniques ineffective. That meant that to work meaningfully with telemetry, one needed to work with ALL the telemetry available. Toil levels went through the roof on account of volume alone. Secondly, and perhaps more importantly, the volume and fine-grainedness of the data sets made it almost impossible for human beings to see the patterns governing the data and hence threatened to render the data sets useless for the more difficult problems pressing the practitioner.

The solution to the intelligibility and toil issue came in two steps:

  • First, new lightweight approaches to tracing became practical. Eschewing reliance on predefined topologies and the injection of tokens into code, these approaches dynamically mapped out an ever-shifting topology as messages were passed from one micro-service to another tracked by short, simple tags associated with the micro-services themselves (not the messages.) The resulting maps structured the vast heaps of metric and log data and while the maps themselves often became complex, near unintelligible spiderwebs of connections among nodes, they laid the groundwork of further analysis and allowed the practitioner to, at least roughly, place his log and metric data into a spatiotemporal framework.
  • Second, AI came to play a critical role in taking the analysis further. Now, AI itself is a broad collection of rapidly evolving technologies including logical inference machines, statistical learning functionality, and, most recently, large language models, and has a range of applications far beyond Observability or APM. Nonetheless, AI has already been shown to provide that missing ingredient that converts raw telemetry or even telemetry structured via some kind of tracing mechanism into actionable information. As a result, users are now looking to the results of yet another paradigm shift - this time to an APM 3.0, that starts from a foundation in telemetry driven Observability complements access to data with algorithm-driven insights into what the data means and how problems and incidents may be resolved.

Six Questions to Ensure You Are Not Living in the Past

In summary, then, when deciding on an APM technology adequate for modern environments, one should ask the following six questions:

  1. Does the technology support full-fidelity access to the telemetry generated by the applications it purports to monitor?
  2. Does the technology allow one to easily integrate visualisation and analysis of all telemetry types?
  3. Does the technology support a seamless end-to-end view of the environment, disregarding the barriers between application and infrastructures when such barriers are irrelevant to an analysis or diagnosis?
  4. Does the technology generate patterns of intelligibility, i.e., context and meaning, directly from the data being ingested, minimising reliance on predefined data models?
  5. Does the technology support a broad variety of AI algorithmic styles, including rule-based inference engines, neural networks, and large language models?
  6. Does the technology scale as needed to the required data volumes without practical constraints?

An answer of yes will just about assure the practitioner that he is working with an APM 3.0 solution adequate to the current and (at least, near future) requirements. An answer of no to any of them suggests that he is working with an APM 2.0, or worse, an APM 1.5 solution, adequate to the legacy portion of an application portfolio but unlikely to meet the demands of a modern digital business