Monitoring OKE Workloads | OCI Specialists

Monitoring a traditional server is straightforward in one respect. The server has a name, it stays where it is, and you watch it. A workload running on OCI Kubernetes Engine, usually called OKE, offers no such stability. The same application might run as five pods this minute and eight pods the next, those pods might be scheduled onto different nodes as the cluster rebalances, and a pod that you were watching can simply cease to exist when it is replaced by a newer one. This is by design and it is the source of the platform's resilience, but it changes monitoring fundamentally. You cannot watch instances, because instances are ephemeral. You have to watch the service the instances provide, and you have to understand the layers beneath it.

The layers of an OKE cluster

An OKE workload sits in a stack, and trouble can originate at any level, so monitoring has to cover all of them. At the bottom are the worker nodes, the compute instances on which everything runs, and these are watched much like any server, for processor, memory, and disk pressure. Above them sits the cluster itself, the Kubernetes control plane and the scheduling decisions it makes, where you watch whether pods are being placed successfully and whether the cluster has the capacity it needs. Above that are the workloads, the deployments and pods running your application, where you watch whether the desired number of replicas is actually running and healthy. At the top is the application itself, the code inside the containers, whose behaviour is watched in the same way as any application. A problem at any layer can present as a symptom at another, which is why a partial view is so misleading on Kubernetes.

Layer	What you watch	Typical symptom of trouble
Worker nodes	Processor, memory, disk, node readiness	Pods evicted or unschedulable
Cluster	Scheduling, capacity, control plane health	Pods stuck pending
Workloads	Replica count, restarts, pod health	Crash loops, missing replicas
Application	Request latency, errors, throughput	Slow or failing requests

Watch the service, not the pod

The central shift in mindset is to stop caring about any individual pod and start caring about the service as a whole. The question that matters is not whether a particular pod is healthy but whether the desired number of healthy pods is running and serving requests. If one pod dies and Kubernetes immediately replaces it, that is the platform working as intended, and an alarm that fires on the death of a single pod is pure noise. What deserves attention is when the service cannot maintain its desired state, when replicas are missing and not being replaced, when pods are crashing repeatedly, or when the cluster cannot schedule what it has been asked to run. Framing monitoring around the desired state of the service rather than the fate of individual pods is what keeps an OKE monitoring setup sane.

Stop watching the pod. Watch whether the desired number of healthy pods is running and serving.

The signals that matter on OKE

A few signals carry most of the value when monitoring OKE, and they map to the questions an operator actually asks. The first is replica health, the gap between how many pods should be running and how many actually are, because a persistent gap means the service is degraded. The second is restart behaviour, because a pod that keeps restarting is in a crash loop and something is wrong with it or its configuration. The third is scheduling, whether pods are stuck in a pending state because the cluster has nowhere to put them, which points to a capacity problem at the node level. The fourth is node pressure, the resource exhaustion on workers that causes Kubernetes to evict pods to protect the node. Watching these four tells you whether the cluster is keeping its promises, and they are far more useful than a flood of per pod metrics.

A framework for monitoring OKE

Setting up OKE monitoring is most effective approached layer by layer, from the foundation upward.

Cover the nodes. Watch worker node resources and readiness so you know when the foundation is under pressure before pods start being evicted.
Watch scheduling and capacity. Track pending pods and cluster capacity so you catch the case where there is nowhere to run new work.
Track desired state. Alarm on the gap between desired and actual replicas and on repeated restarts, the signals that the service itself is degraded.
Instrument the application. Add application performance monitoring so you can see request latency and errors inside the containers, not just their existence.
Centralise the logs. Ship pod logs to a central place before the pods disappear, because a crashed pod takes its local logs with it.

This order builds coverage from the platform up to the application, so that when something goes wrong you can see which layer it started at rather than guessing from a single symptom.

Logs from things that disappear

One detail trips up almost everyone new to monitoring containers. When a pod is destroyed, its logs go with it unless they were collected first. On a fixed server, logs sit on disk and you can read them after the fact. On Kubernetes, the pod that crashed an hour ago is gone, and so is its evidence, unless logs were being shipped continuously to a central store as they were produced. This makes centralised logging not a nicety but a requirement for any serious OKE monitoring, because it is often the only way to investigate something that has already vanished. Collecting logs centrally as they are emitted is the only reliable way to have the evidence when you need it.

Keeping observability stable on a moving target

Monitoring OKE comes down to accepting that the individual pieces are transient and building observability around the things that persist, which are the service, its desired state, and the centrally collected record of what happened. Watch the service rather than the instance, cover every layer from node to application, and make sure the evidence survives the disappearance of the thing that produced it. Done this way, the constant churn of a Kubernetes cluster stops being a monitoring problem and becomes simply the normal background against which a clear picture of service health is maintained. This sits within the wider practice in the complete monitoring and observability guide, and connects closely to health checks and probes that tell Kubernetes itself when a pod is ready. When you are running containers on OKE and want monitoring that survives the churn, our OCI monitoring and observability practice builds it around the service rather than the pod.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.