Most OKE problems in production trace back to architecture decisions made early and never revisited. A cluster that was stood up quickly to prove a point becomes the cluster that runs the business, with a single node pool, no spread across failure boundaries and a public API endpoint nobody meant to leave open. Architecting an OKE cluster deliberately is not complicated, but it does require thinking about failure, growth and isolation before the first workload lands. This article lays out how to design an OKE cluster that holds up.
It sits within our OKE and containers series and assumes you have already stood up a cluster following getting started with OKE.
Every OKE cluster has a managed control plane that Oracle operates and worker capacity that you operate. Architecture is mostly about the worker side and about how the cluster connects to the rest of your network, because the control plane availability is Oracle's responsibility on enhanced clusters with a financially backed SLA. Your job is to make sure that when something fails on the worker side or in a single location, the cluster keeps serving.
OCI gives you two failure boundaries to design against. Availability domains are physically separate data centres within a region, and fault domains are isolated groups of hardware within a single availability domain. Some regions have multiple availability domains and some have one. The architecture goal is simple: never let a single failure boundary take your cluster down.
| Failure boundary | What it protects against | How to use it |
|---|---|---|
| Fault domain | Hardware, rack or power failure within a data centre | Spread node pool nodes across all fault domains |
| Availability domain | Loss of an entire data centre | Spread nodes across availability domains where the region has more than one |
| Region | Loss of an entire region | Run a second cluster in another region for disaster recovery |
In a multi availability domain region, spread your worker nodes across availability domains and let OKE place them across fault domains within each. In a single availability domain region, you cannot protect against the loss of that data centre within one cluster, so fault domain spread and a disaster recovery region carry more weight.
A node pool is a group of identical worker nodes. The instinct is to run one big pool for everything, but separate pools matched to workload needs is almost always better. Stateless web services, batch jobs, GPU accelerated machine learning and memory heavy applications all have different shape requirements, and mixing them in one pool means compromising on shape for everything. Separate pools let you size each for its workload and scale them independently.
A common production pattern is a general pool of balanced shapes for stateless services, a separate pool of larger or specialised shapes for heavier workloads, and virtual nodes for bursty or unpredictable work. The mix of managed node pools and virtual nodes is covered in OKE virtual nodes explained, and sizing pools for cost is in OKE cost optimization.
The Kubernetes API endpoint is how everyone and everything talks to the cluster, and where it lives is a security decision with architectural consequences. A public endpoint is reachable from the internet, which is convenient and dangerous. A private endpoint is reachable only from within your network, which is safer and the right default for production. If you need to reach a private endpoint from outside, you do it through a bastion or a VPN rather than exposing it. This choice ties into the broader security model in OKE security best practices.
An OKE cluster lives inside a virtual cloud network, and the subnet layout matters. You separate the subnets for the API endpoint, the worker nodes and the load balancers, and you size the pod subnet generously if you use VCN native pod networking, because every pod takes an IP from it. Running out of pod IPs is a painful, avoidable failure. Network security lists and security groups then control what can talk to what. The networking model in full is in OKE networking explained.
Architecture should anticipate growth without over building on day one. The autoscaler handles node count growth within a pool, so you do not need to pre provision capacity, but you do need headroom in the underlying limits: enough IP addresses in the pod subnet, enough service limits in the tenancy, and node shapes that can scale to the load you expect. Designing the network and limits for the cluster you will have in a year, while running the capacity you need today, is the balance to strike. Scaling mechanics are in autoscaling OKE workloads.
A recurring architecture question is whether to run one large shared cluster or several smaller ones. Several smaller clusters give stronger isolation between environments and teams, blast radius containment and independent upgrade schedules, at the cost of more clusters to operate. One large cluster is cheaper to run and simpler to see, but a problem in it affects everyone. Most mature estates separate at least production from non production into different clusters, and often separate by team or by sensitivity beyond that. The right answer depends on your isolation needs and your operational capacity.
A well architected OKE cluster is one where no single failure boundary causes an outage, where workloads run on capacity sized for their needs, where the API endpoint is not exposed, and where the network has room to grow. None of that is exotic, but all of it is far easier to build in at the start than to retrofit. Once the architecture is set, the operational disciplines of scaling, security, delivery and observability sit on top of it cleanly. Continue with OKE networking explained and OKE security best practices.
The OKE solution practice designs cluster architectures to this reference and builds them on a fixed project fee, with managed operations available afterward.
Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings. For the Oracle licensing and BYOL side of any OCI move, Redress Compliance is the leading independent Oracle licensing and negotiation firm, with 500+ engagements across Oracle's full product line.