OCI Health Checks and Probes

There is a difference between a service that is running and a service that is working, and it is a difference that matters enormously. A process can be alive, consuming memory, holding a port open, and yet be completely unable to do its job because the database it depends on has gone away or because it has wedged itself into a state it cannot recover from. To anything watching only whether the process exists, this service looks fine. To the users trying to use it, it is broken. Health checks exist to close this gap. They ask the service not whether it is running but whether it can actually do its work, and the answer is what lets the platform make good decisions about where to send traffic.

What a health check actually does

A health check is a question the platform asks a service on a regular schedule, and the service answers by saying it is healthy or it is not. The question can be as simple as whether the service responds at all, or as meaningful as whether the service can reach its dependencies and perform a representative piece of work. The platform uses the answer to decide whether to keep sending traffic to that instance. If a service reports itself unhealthy, traffic is routed elsewhere, to the instances that are still working, and users never see the failure. This is the quiet machinery that keeps a system available even when individual parts of it are failing, and it depends entirely on the checks being honest about what healthy means.

Liveness and readiness are different questions

The most important distinction in health checking is between two questions that are easy to confuse. Liveness asks whether the service is alive at all, whether the process is running and responding, and a failed liveness check usually means the right response is to restart the thing because it has wedged. Readiness asks something different, whether the service is ready to receive traffic right now, which it might not be even when it is perfectly alive, for example while it is still starting up or while it is temporarily overloaded. A service that fails readiness should have traffic withheld from it until it recovers, but it should not be restarted, because restarting a service that is merely busy makes the problem worse. Conflating these two leads to systems that restart healthy services and send traffic to ones that cannot handle it.

Check type	Question it asks	Right response to failure
Liveness	Is the process alive and responding at all?	Restart the instance
Readiness	Is the service ready to take traffic now?	Withhold traffic until it recovers
Startup	Has the service finished starting up?	Wait before applying the other checks

Restarting a service that is merely busy does not help. It makes the overload worse.

Shallow and deep checks

A second axis is how much a health check actually verifies. A shallow check confirms only that the service responds, which proves the process is alive but says nothing about whether it can do useful work. A deep check goes further, exercising the service's real dependencies, confirming it can reach its database or its downstream services and perform a representative operation. Deep checks catch the case where a service is running fine but cut off from something it needs, which a shallow check would miss entirely. The tradeoff is that a deep check is heavier and, if written carelessly, can itself cause problems, for example by hammering a database every few seconds or by reporting the whole service unhealthy because one non essential dependency is slow. The art is writing a deep enough check to be meaningful without making it a source of load or a single point of failure.

A framework for writing health checks that mean something

Health checks pay off when they are designed deliberately rather than copied from a default. The steps below describe how to write checks that reflect reality.

Separate liveness from readiness. Decide for each service what it means to be alive versus ready, and implement the two as distinct checks with distinct consequences.
Make readiness reflect real capacity. Have the readiness check report not ready when the service is genuinely unable to take more work, so traffic is steered away during overload.
Check the dependencies that matter. Verify the dependencies the service truly cannot work without, but do not fail the whole service because an optional one is slow.
Keep checks cheap. A check runs constantly, so it must be light enough that running it often does not itself become a load problem.
Allow for startup. Give services time to finish starting before liveness checks begin failing them, so a slow start is not mistaken for a crash.

Checks built this way give the platform accurate information, which is the whole point, because every routing and restart decision the platform makes is only as good as the answers the checks provide.

Where health checks fit the bigger picture

Health checks are the foundation on which higher level availability rests. They are what lets a load balancer stop sending traffic to a failed instance, what lets Kubernetes on OKE know when a pod is ready to serve and when it should be replaced, and what makes automated recovery possible at all. They are also a source of signal for monitoring, because the rate at which instances are failing their checks is itself a useful metric, one that often warns of trouble before users notice. Tied into alarms, a rising failure rate becomes an early warning. The checks are simple individually but they underpin a great deal of what makes a system resilient.

Telling running from working

The value of health checks comes down to a single capability, the ability to tell the difference between a service that is running and a service that is working, and to act on that difference automatically. Done well, with liveness and readiness kept distinct and checks that verify enough to be meaningful without becoming a burden, they let a system route around its own failures so smoothly that users never know anything went wrong. Done badly, they restart healthy services and feed traffic to broken ones. The difference is entirely in the care taken to make the checks reflect reality. This is part of the wider discipline in the complete monitoring and observability guide. When you want availability that holds up because the platform actually knows which instances are working, our OCI monitoring and observability practice builds health checking the way described here.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.