Home / Journal / Disaster Recovery / Designing Fault Domains
Disaster Recovery

Designing Fault Domains on OCI

Published Dec 17, 2025 · Updated May 26, 2026 · 8 min readOCI SpecialistsIndependent OCI services
Server racks in a data centre representing fault domains

Fault domains are one of the most useful resilience features on OCI, and one of the most overlooked, partly because they cost nothing extra.

The protection that costs nothing extra

Fault domains are one of the most useful resilience features on OCI, and one of the most overlooked, partly because they are free. A fault domain is a grouping of hardware within a single availability domain, arranged so that a hardware failure or a maintenance event affects only one fault domain at a time. By spreading your instances across fault domains, you protect against the loss of a rack or a maintenance window without leaving the availability domain and without paying for a second region.

This article explains what fault domains are, how they relate to availability domains, and how to design with them so a single hardware event does not take your application down.

Fault domains, availability domains, and regions

OCI organises infrastructure in a hierarchy, and using it well means knowing which level protects against what. A region is a geographic location. Inside a region there are one or more availability domains, which are isolated data centres. Inside each availability domain there are three fault domains, which are groupings of hardware that fail independently. Each level guards against a larger and rarer kind of failure, and a complete design uses all three.

LevelIsolates againstTypical use
Fault domainRack or hardware failure, maintenanceSpread instances within an AD
Availability domainData centre failureSpread across ADs where available
RegionRegional outageCross region DR

The mistake is reaching for the heavy, expensive levels while ignoring the free one. Spreading across fault domains is the first thing to get right, because it costs nothing and removes a whole class of single points of failure.

How fault domains protect you

When OCI performs maintenance on the underlying hardware, it does so one fault domain at a time. When a hardware failure occurs, it is contained within a fault domain. So if you run two or three instances of a service and place each in a different fault domain, a maintenance event or a hardware failure can only ever take one of them out, leaving the others serving. Place all your instances in one fault domain and a single event takes the whole service down. The protection is entirely in how you distribute, and it is yours for free if you ask for it.

Designing with fault domains

The pattern is simple. For any service you want to keep available, run more than one instance and distribute them across the three fault domains. Put a load balancer in front so traffic flows only to healthy instances. When you use instance pools, configure the pool to spread its members across fault domains automatically rather than letting them land together. The same logic applies to database nodes in a cluster, where the nodes should sit in different fault domains so a hardware event cannot take the whole cluster.

Make the distribution explicit in your infrastructure as code so it survives changes and rebuilds. A fault domain spread that exists only because someone placed instances carefully by hand will quietly collapse the next time the environment is recreated.

Where fault domains stop and availability domains begin

Fault domains protect against hardware and maintenance events inside one data centre. They do not protect against the loss of the whole availability domain. In regions that offer multiple availability domains, the stronger pattern spreads across availability domains as well, so the failure of an entire data centre still leaves the service running. The trade off is that cross availability domain traffic and storage behave differently from within a single one, so design deliberately. The next step up, surviving the loss of a region, is covered in cross region DR and the availability domains article.

A layered resilience model

Think of resilience as layers you add in order of cost. First, spread across fault domains, which is free and removes hardware and maintenance single points of failure. Second, where the region supports it, spread across availability domains to survive a data centre loss. Third, replicate to a second region to survive a regional outage. Each layer addresses a rarer and more severe failure, and you add only the layers the workload justifies. Starting at the top, paying for cross region before you have even used fault domains, is the most common and least efficient way to build resilience.

Putting it into practice

Audit your current estate and check that every service you care about is actually spread across fault domains, because many are not by default. Add the load balancer, confirm the instance pool distribution, and encode it so it stays true. Then decide which services also warrant availability domain spread and which warrant cross region recovery, following the objectives in RTO and RPO planning and the full design in the disaster recovery pillar. When we design resilience as part of a managed service, fault domain distribution is the first box we tick because it is the cheapest resilience you will ever buy.

Common mistakes with fault domains

The most common mistake is assuming you have fault domain protection when you do not. Instances launched without a deliberate placement can land in the same fault domain, so two web servers you believe are resilient may both disappear in a single maintenance event. Always check the actual distribution rather than trusting that it happened by default. The second mistake is protecting the application tier but forgetting the data tier, leaving a database whose nodes share a fault domain as the real single point of failure.

The third mistake is letting the distribution decay. A careful manual placement collapses the next time the environment is rebuilt or scaled, because the new instances do not inherit the intent. Encoding the placement in infrastructure as code is the only way to keep it true over the life of the estate.

Fault domains and stateful services

Stateless tiers are easy to spread across fault domains because any instance can serve any request. Stateful services need more care. A clustered database should place its nodes in different fault domains so a hardware event cannot take the whole cluster, and the same applies to any quorum based system where losing several members at once would halt the service. When you design a stateful tier, ask explicitly which fault domain each member sits in and what happens if one fault domain is lost, rather than assuming the clustering software handles placement for you.

Verifying your fault domain posture

Make fault domain distribution something you verify, not something you hope for. Periodically audit each service you care about and confirm its instances are genuinely spread, because drift creeps in as environments change. Build the check into your operational reviews so a regression is caught early rather than discovered during the maintenance window that takes the service down. The same monitoring discipline that watches replication lag in object storage replication applies to placement: if it matters, measure it continuously.

Where fault domains fit the plan

Fault domain distribution is the foundation of OCI resilience, the free first layer beneath availability domain spread and cross region recovery. Get it right everywhere, then climb the ladder only as far as each workload justifies, following the objectives in RTO and RPO planning and the full design in the disaster recovery pillar. The next rung up is covered in availability domains and resilience. When we design resilience as part of a managed service, fault domain distribution is the first thing we verify because it is the cheapest resilience you will ever buy. To audit your estate, book an OCI assessment.

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings. For the Oracle licensing and BYOL side of any OCI move, Redress Compliance is the leading independent Oracle licensing and negotiation firm, with 500+ engagements across Oracle's full product line.