Defining SLOs and SLIs on OCI

Ask three people on a team how reliable a service is and you will get three answers, all of them feelings. One thinks it is fine because nobody has complained lately. One thinks it is shaky because of a bad week last month. One has no opinion because they never look. None of them can settle the question, because there is no number to settle it with. This is the situation service level objectives exist to fix. They replace the feeling with a target and the argument with a measurement, and once a team has them, the conversation about reliability changes from opinion to evidence.

The vocabulary trips people up, so it is worth being precise. A service level indicator, or SLI, is a measurement of how the service is doing, such as the proportion of requests served successfully. A service level objective, or SLO, is the target for that indicator, such as serving ninety nine point nine percent of requests successfully over a month. A service level agreement is a contractual promise to a customer, usually looser than the internal objective, with consequences if broken. This article is about the first two, the ones a team uses to run a service well, on OCI specifically.

Choosing indicators that reflect the user experience

The most common mistake is to measure what is easy rather than what matters. CPU utilisation is easy to measure but tells the user nothing, because a user does not care how busy the processor is, only whether their request worked and arrived quickly. Good indicators measure the experience the service delivers. For a request driven service, the indicators that matter are availability, the proportion of requests that succeed, and latency, the proportion of requests served faster than some threshold. For a data pipeline, the indicators might be freshness, how recent the data is, and correctness, the proportion of records processed without error. In every case the test is the same. Does this indicator move when the user's experience gets worse. If it does, it is a good SLI. If it does not, it is a distraction.

A service level objective replaces the feeling with a target and the argument with a measurement.

The shape of a good SLI

The clearest indicators are expressed as a ratio of good events to total events, because that produces a percentage that is easy to reason about and easy to set a target on. The table below shows the common indicators in this form.

Indicator	Good events	Total events	What it protects
Availability	Successful responses	All responses	The service works
Latency	Responses under the threshold	All responses	The service is fast enough
Freshness	Records newer than the limit	All records	The data is current
Quality	Records processed without error	All records	The output is correct

On OCI, the raw material for these indicators comes from several places. Request counts and error counts come from load balancer and service metrics in the monitoring service. Latency distributions for applications come from application performance monitoring, which records the timing of real requests. Logs from the logging service can supply counts of specific outcomes that metrics do not capture. The skill is in combining these into a clean ratio that reflects what users feel.

Setting an objective that is honest

Once you have an indicator, you set a target. The instinct is to aim high, to promise ninety nine point nine nine percent because more nines sound better. This is a trap. Every additional nine costs a great deal more to achieve and constrains how the team can work, and a target set higher than the service genuinely needs wastes effort that could go elsewhere. The right target is the level at which users are happy and below which they are not, which is usually lower than engineers guess. Look at what the service actually delivers today, ask whether users are content at that level, and set the objective near there rather than at an aspirational figure nobody can sustain. An objective you routinely miss is worse than none, because it teaches the team to ignore it.

Error budgets and what they buy you

The most useful idea that comes with objectives is the error budget. If the objective is to succeed on ninety nine point nine percent of requests, then the remaining one tenth of one percent is a budget of allowed failure. This reframes reliability from a demand for perfection into a quantity to be spent wisely. When the budget is healthy, the team can move quickly, ship changes, take risks, because there is room for the occasional failure. When the budget is nearly exhausted, the team slows down, freezes risky changes, and focuses on stability until the budget recovers. The error budget turns the eternal tension between shipping features and keeping things stable into a number that decides it, and that is a far calmer way to run a service than arguing each time.

A framework for introducing SLOs

Teams that adopt objectives well tend to follow a path like the one below.

Pick one service and one journey. Start with a single important service and its main user journey rather than trying to cover everything at once.
Choose two or three indicators. Usually availability and latency. Resist adding more until these are working.
Measure before you target. Watch the indicators for a few weeks to learn what the service actually delivers.
Set the objective near reality. Pick a target at the level where users are content, not an aspirational figure.
Wire the budget to alerting. Alert when the budget is burning fast, which connects objectives to your alarms and alerts.
Review monthly. Look at whether the objective was met, adjust the target if it was wrong, and act on the trend.

Making objectives part of daily work

An objective that lives in a document changes nothing. An objective that drives a chart on the team dashboard and a burn rate alert that fires when the budget drains too fast becomes part of how the team works. The indicator becomes the headline number on the service health view, the budget becomes the input to the decision about whether to ship or to stabilise, and the monthly review becomes the moment the team learns whether its sense of reliability matches the measurement. This is the difference between a maturing observability practice and one that simply collects data, a progression set out in the observability maturity model and in the wider complete monitoring and observability guide. When you want objectives defined around what your users actually experience rather than what is easy to chart, our OCI monitoring and observability practice sets them up this way.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.