Disaster recovery and high availability are two words that get used as if they were the same thing, and confusing them is how organisations end up with an expensive design that does not actually protect the business. High availability keeps a workload running through the everyday failures that happen all the time, a failed server, a rebooted host, a single component fault. Disaster recovery brings a workload back after a larger event that takes out a whole site or region. You need both, and Oracle Cloud Infrastructure gives you a clear set of building blocks for each. This pillar guide is the map. It explains the resilience hierarchy on OCI from fault domains up to cross region recovery, the database and application techniques that sit on top, how to set targets that match the business, and how to test the design so it works when it is needed. The detailed articles in this cluster go deeper on each piece, and they are linked throughout.
The cleanest way to hold the distinction is by the size of the failure each one answers. High availability is about surviving local, frequent failures with little or no interruption, usually inside a single region, through redundancy and automatic failover. Disaster recovery is about recovering from rare, large failures, the loss of a region or a major corruption event, by failing over to infrastructure somewhere else. High availability is measured in seconds of interruption and is mostly automatic. Disaster recovery is measured in the time to recover and the data you can afford to lose, and it usually involves a deliberate, tested process. A complete design uses both layers, because each protects against failures the other cannot.
OCI gives you a layered set of fault isolation boundaries, and a good design uses them deliberately rather than by accident. From smallest to largest the layers are fault domains, availability domains, and regions, and each protects against a wider blast radius than the last.
A fault domain is a grouping of hardware within a single availability domain, engineered so that a hardware or maintenance event affecting one fault domain does not affect the others. Spreading instances across fault domains is the cheapest resilience you can buy, because it stays inside one availability domain and incurs no cross site cost, yet it protects against the most common class of hardware failure. Every multi instance tier should distribute across fault domains by default. Our article on designing fault domains on OCI covers the placement patterns in detail.
An availability domain is an isolated data centre within a region, with its own power, cooling, and network, so a failure in one availability domain does not cascade to another. Spreading a workload across availability domains, where the region offers more than one, protects against the loss of an entire data centre. This is the backbone of high availability for serious workloads and is what qualifies many services for their higher availability commitments. The companion piece on availability domains and resilience explains how to design across them.
A region is a geographically separate location, and pairing regions is how you protect against an event that takes out an entire region. Cross region designs carry more cost and complexity because data has to be replicated over distance, but they are the only protection against regional loss and are essential for the most critical workloads. Our article on cross region DR on OCI goes through the patterns and tradeoffs.
| Layer | Protects against | Cost and complexity | Primary use |
|---|---|---|---|
| Fault domain | Hardware and maintenance events | Lowest, no cross site cost | Baseline HA for every tier |
| Availability domain | Loss of a data centre | Low to moderate | High availability |
| Region | Loss of an entire region | Highest | Disaster recovery |
Every resilience design should start from two numbers, and skipping them is the most common and most expensive mistake. Recovery time objective, or RTO, is how long the business can tolerate being down. Recovery point objective, or RPO, is how much data the business can afford to lose, measured as a window of time. A workload that must be back in minutes with near zero data loss needs a very different and far more expensive design than one that can be down for hours and lose a few minutes of data. Setting these targets per workload, with the business rather than the technology team, is what turns resilience from a guess into an engineering specification. Our guide to RTO and RPO planning for OCI shows how to gather and apply these numbers.
For most enterprises the database is the part that matters most, because it holds the state everything else depends on. OCI offers strong, native protection for Oracle Database at every level. Within a region, Real Application Clusters and automated failover keep the database available through node failures. Across availability domains and regions, Oracle Data Guard maintains a synchronised standby that can take over with minimal data loss, and is the workhorse of Oracle disaster recovery. On Exadata and Autonomous Database these capabilities are deeply integrated and largely managed for you. The detailed articles on Data Guard on OCI and high availability for Oracle Database on OCI explain how to design and operate them.
Databases are not the whole estate. Application tiers need to be reproducible and redeployable, which is where infrastructure as code earns its place, because a region you can rebuild from definitions recovers far faster than one you have to assemble by hand. Object storage supports cross region replication so that unstructured data and backups exist in more than one place, covered in object storage replication for DR. Block volumes and file systems need their own backup and replication policies. And OCI Full Stack Disaster Recovery coordinates the failover of an entire application stack, not just the database, which we cover in OCI full stack disaster recovery. Backups underpin all of it, and the strategy for them is in backup strategies for OCI workloads.
For the most critical workloads, where even a short recovery time is unacceptable, an active active design runs the workload live in more than one location at once, so the loss of one site is absorbed without a failover event at all. This is the most resilient and the most demanding pattern, requiring careful handling of data consistency and traffic distribution, and it is not justified for most workloads. Knowing when it is warranted is as important as knowing how to build it, and we explore both in active active architecture on OCI.
The single biggest failure in disaster recovery is not the design, it is the lack of testing. A plan that has never been exercised is a hope, and hopes fail at the worst moment. Regular, realistic DR tests prove that the failover actually works, that the RTO and RPO targets are met in practice, and that the team knows the runbook under pressure. Testing also surfaces the small gaps, an unreplicated config, a missing DNS change, a permission that was never granted, that turn a clean design into a failed recovery. Our article on DR testing on OCI sets out how to run tests that prove the design without disrupting production. Specific workloads have their own playbooks, such as DR for EBS on OCI.
Resilience is a spectrum, and the goal is not maximum resilience everywhere but the right resilience for each workload. A tier one revenue system may justify cross region active active. A reporting database may be perfectly safe with daily backups and a multi hour recovery. Spending active active money on a workload that tolerates hours of downtime is waste, and spending backup only money on a system the business cannot live without is negligence. The design framework exists to match the spend to the value, workload by workload, and that matching is where independent advice pays for itself.
Resilient design on OCI is a layered discipline: set RTO and RPO first, use fault domains, availability domains, and regions deliberately, protect the database with RAC and Data Guard, make the application reproducible, and test the whole thing regularly. Match the spend to the value of each workload and you get protection without waste. Continue with OCI full stack disaster recovery, Data Guard on OCI explained, cross region DR on OCI and RTO and RPO planning for OCI. Our disaster recovery and HA practice designs, builds, and tests resilient OCI estates.
Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings. For the Oracle licensing and BYOL side of any OCI move, Redress Compliance is the leading independent Oracle licensing and negotiation firm, with 500+ engagements across Oracle's full product line.