Splunk Inc.

04/19/2024 | News release | Distributed by Public on 04/19/2024 13:24

Incident Review: How To Conduct Incident Reviews & Postmortems

In IT and business, disruptions and outages are part of new changes, like new system rollouts or new deployments. Incident review, sometimes called an incident postmortem, is a structured process for analyzing and learning from such incidents within an organization's system.

The incident review process documents:

  • What went wrong in a given incident.
  • Why an incident happened.
  • Strategies to ensure similar issues don't repeat in the future.

The best part of an incident review is that, when done well, you can easily improve service quality with a set of specific actions, like automating the recovery processes.

So, let's take a look at the incident review process. In this article, you will learn what an incident review/postmortem is, the steps involved, and the best practices to maximize valuable takeaways.

What is incident review?

Organizations routinely encounter system, site, and machine failures. These disruptions in the normal service operations of any system are called "incidents", and they can range from minor to severe incidents depending on the impact and nature.

Importantly, there's something for teams to learn from almost every incident. And that's what the review is meant to capture: the lessons learned from a critical examination of an event or failure within a system. In general, incident review processes involve:

  1. Documenting the incident.
  2. Diagnosing its root cause.
  3. Evaluating its impact.
  4. Creating an action plan to prevent these incidents.

So we can say that the incident review process is one part of your incident response and incident management strategy.

Interestingly, postmortems have long been a part of aviation and manufacturing industries. Only more recently have these concepts gained popularity in the business and technology space, too.

Why incident postmortems are necessary

Yes, it's true that these reviews are optional, unless of course your team or organization mandates them. Still, we think every smart organization should conduct an incident review - here's why:

  • Allows you to do a detailed analysis of the incidents, to truly understand where the breakdown(s) occurred, in people, processes, and/or technologies.
  • Supports ongoing high-level system availability.
  • Clarifies why a system behaves differently after making changes to prevent the same mistakes.

It is a great tool for learning about incident patterns in your systems.

Who performs incident review/postmortem?

Different teams, such as DevOps and SREs, collaborate to review and analyze the incidents using real-time collaboration tools. Ideally, one person should own the postmortem report. It can be anyone from DevOps to SREs to incident managers/commanders.

(This function may even live within a CSIRT: critical security incident response team.)

Importantly, every organization or team must define its criteria for reviewing incidents and postmortems. You can automate the trigger when you want to review incidents. This way, the system will automatically be triggered when the following conditions are fulfilled:

  • A certain number of users are affected.
  • Internal or external users report an outage.
  • The organization experiences a certain amount of revenue loss during an outage.

Steps of incident review/postmortems

Every organization has a different structure of postmortem steps that works for them. In general, teams will create a postmortem report and also hold a meeting afterwards to communicate everything to the wider team.

Let's look at both.

Creating a postmortem report

These are sections to understand and include in any incident review documentation.

Incident summary

The first step of postmortems is writing a summary of the incident to provide an overview of the initial problem. It includes writing about the type of incident that happened, whether it was a service problem, a bug in the code, or a site failure.

Identifying the root cause

This step involves identifying the incident's root cause and what triggered it. The system automatically sends alerts to the team via email or call. Different types of incident triggers include:

Often, IT or SRE team members must respond to the alerts immediately to resolve problems. A backup person must always be available in case the alert person is unavailable.

Impact on users

Not all incidents are the same. The severity varies and can impact one user or the entire site. It happens when a service is down for all users or when data is compromised.

While a minor incident results in a minor inconvenience, with an incident response plan ready, you analyze how an incident impacted users.

(Related reading: Understand how incident severity levels work.)