Home / Journal / OCI Architecture / Designing for High Availability on OCI
OCI Architecture

Designing for High Availability on OCI

Most outages are not the cloud failing. They are an architecture that assumed nothing would fail. High availability on OCI is the discipline of removing single points of failure so the loss of one component is not the loss of the service.

Published Oct 9, 2024 · OCI Specialists · 11 min read
Designing for High Availability on OCI

Most outages are not caused by the cloud failing. They are caused by an architecture that assumed nothing would fail. A single instance in a single availability domain serving a workload that the business depends on is a design decision, and it is a decision that holds right up until the moment the underlying hardware, the network path, or the wider domain has a bad day. High availability on OCI is the discipline of removing single points of failure so that the loss of one component does not become the loss of the service. This article sets out how to design for it.

High availability is not a feature you switch on. It is a property that emerges from how you place and connect components, and it costs real money and real complexity, so it has to be applied where the business value justifies it rather than everywhere by reflex. The sections below explain the building blocks OCI gives you, the patterns that combine them, and how to decide how much availability a given workload actually needs. It builds on the foundations in OCI Landing Zone and Architecture: A Complete Guide and the design discipline in OCI Well Architected Framework Explained.

The building blocks: regions, availability domains and fault domains

OCI gives you three levels of physical separation, and understanding them is the whole basis of availability design. A region is a geographic location, and regions are entirely independent of each other, which makes them the unit of separation for disaster recovery. Within a region there are availability domains, which are isolated data centres with independent power, cooling and networking, so a problem in one availability domain does not affect another. Within each availability domain there are fault domains, which are groupings of hardware that fail independently, so spreading instances across fault domains protects against the loss of a rack or a hardware group. Designing for availability means choosing, for each workload, how far across these levels you spread its components.

LevelProtects againstTypical use
Fault domainLoss of a hardware group within a data centreAlways, it is free to spread across them
Availability domainLoss of a whole data centreProduction workloads needing high uptime
RegionLoss of an entire geographic locationDisaster recovery and the most critical services
Availability is not a switch. It is a series of placement decisions, and each level you protect against costs more than the last.

Spreading compute across fault domains and availability domains

The simplest and cheapest availability win is to run more than one instance and spread them across fault domains, because the spread costs nothing extra and removes the single hardware group as a point of failure. The next step is to spread across availability domains, which protects against the loss of a whole data centre but requires that the workload can run in more than one domain at once, with a load balancer distributing traffic across them. Not every region has multiple availability domains, so this pattern depends on the region you chose, which is one reason region selection, covered in Multi Region Architecture on OCI, matters so much for availability.

Load balancing as the front door

A load balancer is the component that makes multiple instances look like one service, distributing requests across healthy instances and removing unhealthy ones from rotation. It is the front door to a highly available compute tier, and it should itself be highly available, which on OCI means it spans availability domains so that the front door does not become the single point of failure you worked to eliminate behind it. Health checks are central here, because the load balancer can only route around a failed instance if it can detect the failure, so designing meaningful health checks that reflect whether the application is genuinely serving traffic is as important as the load balancer itself.

Database availability

The database is usually the hardest part of an availability design, because it holds state and state cannot simply be duplicated across instances the way stateless compute can. OCI offers several answers depending on the database, from clustering technologies that keep multiple database instances serving the same data, to standby databases that replicate from a primary and can take over if it fails. The right choice depends on how much downtime and how much data loss the workload can tolerate, expressed as recovery time and recovery point objectives, which is the same framing used for the broader continuity planning in our disaster recovery work. Getting the database tier right is often where most of the availability budget goes, and rightly so, because it is usually where the business risk concentrates.

Stateless design makes availability easier

Workloads that hold no state in the compute tier are far easier to make highly available, because any instance can serve any request and a failed instance can be replaced without losing anything. The architectural move that pays off here is pushing state out of the compute tier and into managed services, databases, object storage or caches, so that the compute tier becomes a herd of interchangeable instances rather than a set of irreplaceable ones. This is the same principle that underpins the scaling patterns in Scaling Patterns on OCI, and it is one of the highest leverage decisions in the whole design, because it makes both availability and scaling dramatically simpler.

Health checks, failover and testing

An availability design that has never been tested is a hypothesis, not a guarantee. The components may be spread correctly and the failover configured, but until you have actually removed an instance, an availability domain or a database primary and watched the service stay up, you do not know that the design works. Building in regular failover testing, ideally automated, is what turns an availability design from something that looks right on a diagram into something the business can rely on. The same applies to the health checks that drive failover, which should be tested to confirm they detect real failures rather than only the obvious ones.

A framework for designing high availability

  1. Define the objectives for the workload, how much downtime and data loss it can tolerate.
  2. Remove the cheap single points of failure by spreading compute across fault domains.
  3. Spread across availability domains where the uptime requirement justifies it.
  4. Put a highly available load balancer in front of the compute tier.
  5. Choose a database availability pattern matched to the recovery objectives.
  6. Push state out of compute so instances become interchangeable.
  7. Test failover regularly so the design is proven, not assumed.

Matching availability to business value

The temptation with availability is to apply the highest level everywhere, but that is expensive and usually unnecessary, because not every workload carries the same business risk. A development environment does not need multi availability domain redundancy, and a reporting system that can be down for an hour does not need the same design as a payment path that cannot be down at all. The discipline is to tier your workloads by how much an outage actually costs the business, and apply availability investment in proportion, so the money goes where the risk is. This tiering conversation is one we have early in an engagement, because it shapes the whole architecture and the whole bill, and it prevents both the under engineered design that fails and the over engineered design that wastes money.

Where this fits the engagement

Designing high availability is part of our OCI Consulting and Advisory work, where we tier workloads and design the availability each one needs, and it connects directly to our Disaster Recovery and HA practice for the cross region continuity that sits above in region availability. The aim is a design where the loss of any single component is a non event for the business, achieved at a cost that matches the value of the workloads it protects.

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings. For the Oracle licensing and BYOL side of any OCI move, Redress Compliance is the leading independent Oracle licensing and negotiation firm, with 500+ engagements across Oracle's full product line.