01/23/2020 | News release | Distributed by Public on 01/23/2020 16:23
If you've ever been an on-call SRE, you're familiar with alert fatigue: the burned out feeling that creeps in after responding to alert after alert from tons of services and tools across your stack. Not only is this phenomenon exhausting, but constant pages also limit your ability to focus on other work, even if you're simply clicking 'acknowledge' ('acking'). Research has shown that people lose up to 40% of productive time with brief context switches. Many of the alerts causing never-ending streams of pages are neither urgent nor important, and don't require any human action. So, where are they coming from?
Here are five sources of noise that can create alert fatigue and distract your on-call DevOps or SRE team from the real issues that need attention in your production system:
Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of noise that are prevalent enough to be annoying but not always worth going through the legwork of turning the alerts off at their source. These notifications come from all kinds of tools in your production system and tend to get quickly acked but largely ignored since there usually isn't an underlying actionable issue.
Some noisemakers indicate problems that may eventually need to be addressed, but are low on the current priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they're probably not adding value.
Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you're trying to problem-solve, sometimes prompting SREs to silence pages or blindly ack incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a risk to your team's ability to notice important problems.
Similar to flapping alerts, but more a symptom of redundant monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You're aware of the problem after the first notification, so additional alerts letting you know that it's still there can add frustration.
These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that aren't worth your time.
Take a quick scroll through your team's pages from the past day or week and think about each one. How many fit into one of these categories? Noisy pages like these create distractions, build frustration, and hide real problems, and as the complexity of modern production systems continues to grow, the volume will only increase.
Implementing an AIOps platform, like New Relic AI, can help you tackle alert noise across your stack and create a continuously-improving, streamlined system for correlating and prioritizing incidents. Many layers of machine learning-driven filters and logic power New Relic AI. A correlation engine looks for all of these sources of noise. It also adapts to continually provide more relevant alerts, reducing pager fatigue and empowering your team to stay focused on important issues. Learn more about New Relic AI (currently in private beta) today.