Alert Fatigue: Tuning OCI Alarms

There is a predictable failure pattern in monitoring, and it has nothing to do with whether the tooling works. A team turns on alarms for everything they can think of, the alarms fire constantly, and within a few weeks the people on call have learned that almost every page is something they can safely ignore. The pages keep arriving, the team keeps dismissing them, and then one day a page that genuinely matters arrives and is dismissed along with the rest. This is alert fatigue, and it quietly undermines the entire reason monitoring exists. The fix is not more alarms or better tooling. It is fewer, sharper alarms that each carry a clear meaning, and a deliberate practice of tuning the ones you keep.

Why alert fatigue happens

Alert fatigue is a human response to a flood of low value signals. When alarms are easy to create, the natural instinct is to add one for every metric that might ever be interesting, and the result is a stream of notifications that mostly do not require any action. Each individual alarm seemed reasonable when it was created, but the sum of them trains the team to treat paging as noise. The deeper problem is that an alarm carries an implicit promise, that when it fires a human should do something, and once that promise is broken often enough the team stops believing it. From then on every alarm, including the important ones, is met with the same shrug.

It is worth being clear about the cost. The cost is not the wasted minutes spent dismissing a page, though those add up. The cost is the erosion of trust in the signal itself, because a monitoring system that cries wolf is worse than no monitoring at all. No monitoring at least leaves the team alert to the possibility that they are flying blind. A noisy system lulls them into a false confidence that everything is covered while teaching them to look away.

A monitoring system that cries wolf is worse than no monitoring at all, because it teaches the team to look away.

The two ways an alarm goes wrong

Every poorly tuned alarm fails in one of two ways, and it helps to name them. A false positive is an alarm that fires when nothing is actually wrong, and it is the direct cause of fatigue. A false negative is the opposite, an alarm that stays quiet when something genuinely is wrong, and it is the silent failure that the team only discovers after an outage. Tuning is the work of pushing both of these down at once, and the tension between them is what makes tuning a craft rather than a setting. Loosen a threshold to cut false positives and you risk introducing false negatives. The goal is to find thresholds and conditions that catch the real problems without firing on the normal variation that is not a problem at all.

Failure mode	What happens	Consequence
False positive	Alarm fires when nothing is wrong	Fatigue, the team learns to ignore pages
False negative	Alarm stays silent during a real problem	An outage no one was warned about
Well tuned	Alarm fires only on conditions that need a human	Trust in the signal is preserved

The test every alarm should pass

There is a single question that decides whether an alarm earns its place. When this fires, does a human need to do something now? If the honest answer is no, the alarm should not page anyone. It might still be worth recording, it might belong on a dashboard, it might feed a report, but it should not interrupt a person. This test sounds obvious and yet most noisy alarm estates fail it, because they were built by asking the wrong question, which was what can we measure, rather than the right question, which was what requires a response. Applying this one test to an existing set of alarms usually removes a surprising number of them, and the team feels the relief almost immediately.

The companion idea is severity. Not everything that needs a response needs a response at three in the morning. A clear practice separates alarms that must wake someone from alarms that can wait for business hours, and from conditions that are merely informational. The discipline of building good alarms in the first place is covered in setting up OCI alarms and alerts, and tuning is the ongoing work of keeping that set honest.

A framework for tuning OCI alarms

Tuning is most effective when it follows a repeatable loop rather than reacting to whichever page annoyed someone most recently. The steps below describe that loop.

Inventory what fires. Pull the history of which alarms paged over the last month and how often each one fired. The worst offenders reveal themselves immediately.
Ask the response test of each one. For every frequent alarm, ask whether a human actually did something each time it fired. If not, it is a candidate for change or removal.
Adjust thresholds to real behaviour. Set thresholds against how the system actually behaves under normal load, not against round numbers that feel tidy but bear no relation to the workload.
Require duration, not a single spike. Many false positives come from momentary spikes. Requiring a condition to hold for a few minutes before firing removes most of them without hiding sustained problems.
Route by severity. Send the genuinely urgent to a paging channel and the rest to a review queue or dashboard so nothing pages that does not need to.
Review on a schedule. Repeat the loop regularly, because workloads change and a threshold that was right six months ago drifts out of date.

This loop turns tuning from a frustrated reaction into a maintained discipline. The first pass usually delivers the biggest improvement, but the scheduled review is what keeps fatigue from creeping back as the estate grows and changes.

Thresholds, duration, and normal variation

The single most common cause of false positives is a threshold set against a tidy number rather than against reality. A processor that normally runs warm during business hours will trip a threshold set too low several times a day, and none of those trips mean anything. The remedy is to look at how the metric actually behaves over a representative period and set the threshold above the normal range, so it fires only when behaviour leaves that range. The second most common cause is reacting to a single momentary reading. Systems are bursty, and a one off spike is rarely a problem. Requiring the condition to persist for several minutes before the alarm fires filters out the noise of normal variation while still catching anything sustained. These two adjustments, realistic thresholds and a duration requirement, resolve the majority of fatigue inducing pages on their own.

What to do with the alarms you remove

Removing an alarm from the paging path does not mean throwing the signal away. Much of what does not deserve a page still deserves a place on a dashboard, where a human can see it when they are already looking, or in a report that is reviewed periodically. The distinction is between information that must interrupt a person and information that should simply be available. Moving the merely informative onto dashboards preserves the visibility while protecting the paging channel. The same logic connects tuning to the broader question of what good looks like, which is best expressed through service level objectives, because an alarm tied to a meaningful objective is far easier to justify than one tied to an arbitrary metric.

Keeping the signal trustworthy

Alert fatigue is not a tooling problem and it is not solved by buying something. It is solved by the discipline of treating every page as a promise that a human will act, and by ruthlessly removing or reshaping anything that breaks that promise. A monitoring estate where every page is meaningful is a calmer place to work and, more importantly, a safer one, because the team still trusts the signal when it matters most. Tuning is the ongoing work that keeps it that way, and it is part of the wider practice set out in the complete monitoring and observability guide. When the pages have become noise and no one is sure which ones matter any more, our OCI monitoring and observability practice rebuilds the alarm set around the response test described here.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.