Setting Up OCI Alarms and Alerts

There is a particular kind of dread that comes with a phone buzzing at three in the morning. If the team trusts its alerts, that buzz means something real is wrong and needs attention now, and they get up and deal with it. If the team does not trust its alerts, because nine times out of ten the buzz is noise, then the one time it matters gets ignored along with the rest. Everything good about alerting and everything bad about it flows from this question of trust, and trust is built or destroyed in how the alarms are set up. This guide is about setting up alarms that earn trust, the kind that fire when they should, stay quiet when they should, and carry what a responder needs.

Alert on symptoms, not causes

The most important principle in alerting is to alert on what users feel, not on every internal condition that might contribute to it. Users do not experience CPU utilisation or memory pressure directly. They experience slow responses, errors, and unavailability. An alarm on high CPU might fire when nothing is actually wrong, because the system is busy but coping fine, or it might stay silent while users suffer for a reason CPU never showed. An alarm on the symptom, such as elevated error rate or response time, fires when and only when users are actually affected, regardless of which underlying cause produced it. Symptom based alerts are both more reliable and less noisy, because they track the thing you actually care about rather than a proxy for it.

Users do not experience CPU utilisation. They experience slow responses and errors. Alert on what they feel, not on every internal condition that might contribute.

Thresholds and dwell time

Two settings decide whether an alarm is steady or jumpy. The threshold is the value that counts as a problem, and the dwell time is how long the condition must hold before the alarm fires. Both need thought. A threshold set too close to normal operating values will trip on routine variation, while one set too far away will let real problems run before anyone is told. The dwell time guards against transient spikes. A metric that briefly crosses a line and immediately falls back is usually noise, and firing on it teaches people to ignore the alarm. Requiring the condition to persist for a sensible period before firing filters out these blips while still catching sustained problems quickly. Tuning threshold and dwell time together is most of the work of making an alarm trustworthy.

Setting	Too tight	Too loose
Threshold	Fires on normal variation, becomes noise	Misses real problems until they are severe
Dwell time	Fires on transient blips that self resolve	Delays the alert on genuine sustained issues
Severity	Everything is urgent, so nothing is	Real urgency buried among low priority alerts

Notice that severity belongs in the same conversation. If every alarm is marked critical, the marking carries no information and responders cannot triage. Reserve the highest severity for the alerts that truly demand someone wake up, and let the rest sit at a level that says look at this during the day.

Give the alert context to act on

An alert that says something is wrong without saying what or where forces the responder to start every investigation from scratch, which wastes the most expensive minutes of an incident. A good alert carries context. It names the affected resource, states the condition that fired, indicates the severity, and ideally points toward the next step, such as a relevant dashboard or runbook. The goal is that a responder reading the alert on their phone already knows roughly what they are dealing with and where to look, rather than having to log in and hunt for the basics. This context is set up when the alarm is created, in the message it sends, and the small effort of writing a clear alarm message pays back many times over during every incident it fires for.

Route alerts to the right place

An alert is only useful if it reaches someone who can act on it, which makes routing as important as the alarm itself. OCI alarms send to notification topics, and those topics fan out to destinations such as email, messaging, or an on call paging system. The setup should ensure that the right people get the right alerts at the right urgency. A critical production alert should page whoever is on call now, through a channel that will wake them, while a low priority informational alert should land somewhere it can be reviewed without interrupting anyone. Mixing these up, by paging people for trivia or by sending genuine emergencies to an unread mailbox, undoes all the care that went into the alarm. Routing is covered in more depth alongside notifications and events.

A framework for setting up an alarm

Each new alarm benefits from a consistent thought process. The steps below keep the result trustworthy.

Start from the symptom. Decide what user facing problem this alarm exists to catch, and pick a metric that reflects it.
Set the threshold with margin. Choose a value clearly outside normal operation, so routine variation does not trip it.
Add dwell time. Require the condition to persist long enough to mean something, filtering out transient blips.
Assign honest severity. Mark it at the level that reflects how urgently a human must respond, reserving critical for the real thing.
Write a useful message. Include the resource, the condition, and a pointer to the next step, so the alert is actionable on its own.
Route it correctly. Send it to the destination and urgency that match its severity, and confirm it actually arrives.

Running every alarm through this sequence produces a set of alerts that responders trust, which is the whole point. An alarm built carelessly is a future false alarm waiting to erode that trust.

Alarms are never finished

The final thing to understand is that alarm setup is not a one time task. Systems change, thresholds that were right become wrong, alarms that mattered become obsolete, and new failure modes appear that need new alerts. An alarm set that is created once and never revisited slowly drifts into noise as the world moves on around it. The healthy practice is to review alarms regularly, retiring the ones that no longer earn their keep and tuning the ones that fire too often or too late. This ongoing tuning is the subject of the deeper discussion of alert fatigue and tuning, and it connects to defining SLOs and SLIs, which give alarms a principled target to aim at. Alarms sit on top of the Monitoring service and form part of the wider practice in the complete monitoring and observability guide. When you want an alerting setup people actually trust, our OCI monitoring and observability practice builds it to the discipline described here.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.