EBS Disaster Recovery on OCI · OCI Specialists

Disaster recovery for Oracle E Business Suite is often the weakest part of an otherwise sound estate, because EBS is a multi tier application and protecting only the database leaves the application tier exposed. Many EBS estates have a database standby and call it disaster recovery, yet have never built or tested the application tier recovery, which means the estate has the appearance of protection without the capability. Real disaster recovery for EBS protects both tiers and is rehearsed until the recovery time is known rather than assumed.

This article sets out how to design and test disaster recovery for EBS on OCI: the database standby, the application tier recovery, the recovery targets that drive the design, and the rehearsal discipline that makes it real. It builds on our EBS on OCI architecture guide and is part of the running Oracle applications on OCI series.

Disaster recovery is more than the database

The defining feature of EBS disaster recovery is that EBS has two tiers to protect, not one. The database tier holds the data and is protected with a standby in a second region. The application tier holds the customizations, configuration, and integrations, and it must be brought up in the second region for the application to function. A design that protects only the database can recover the data but cannot serve users, because there is no application tier in the recovery region to run against the recovered database.

This is the trap many EBS estates fall into. The database standby is straightforward to configure and is often done well, while the application tier recovery is harder and is frequently left undone. The result is an estate that looks protected on paper but cannot actually recover, because half the application is missing in the recovery region. A sound EBS disaster recovery design treats both tiers as first class.

Protecting the database tier

The EBS database is protected with Data Guard, which maintains a continuously updated standby copy of the database in the second region. Because the standby is kept current with the primary, the recovery point is small, meaning very little data is at risk if the primary region is lost. On failover, the standby is promoted to primary and the application tier in the second region is brought up against it.

Data Guard is well understood and reliable, which is why the database side of EBS disaster recovery is usually the part done well. The key decisions are the protection mode, which trades a small amount of performance for a smaller recovery point, and whether the standby is also used for reporting or testing between disasters, which can improve its return on investment. These are decisions worth making deliberately rather than accepting the defaults.

Protecting the application tier

The application tier is where EBS disaster recovery is won or lost. The right approach is to define the application tier as infrastructure as code so it can be stood up in the second region reliably, either kept warm and ready or built from code on failover depending on the recovery time the business requires. Defining it as code also removes the standby drift problem, where a hand maintained recovery environment slowly diverges from production until it no longer works when needed.

The choice between a warm application tier and one built on failover is driven by the recovery time target. A demanding target, where the business needs the application back within minutes, justifies keeping a warm application tier ready in the second region. A more relaxed target, where a longer recovery is acceptable, allows the cheaper approach of keeping only the database standby and building the application tier from code when needed. Either way, the application tier definition must be kept current with production.

An EBS failover that has never been tested is an assumption, not a capability. The rehearsal is where disaster recovery becomes real.

Recovery targets drive the design

The two recovery targets, the recovery time objective and the recovery point objective, drive every other decision in the design. The recovery time objective is how quickly the estate must be back; the recovery point objective is how much data the business can afford to lose. These are business decisions rather than technical ones, and they should be set by the business based on the cost of downtime and data loss, then used to shape the technical design.

Setting these targets deliberately avoids the two failure modes of disaster recovery design. The first is over engineering, where a relaxed business need is met with an expensive warm standby that the business did not require. The second is under engineering, where a demanding business need is met with a design that cannot actually meet it. Matching the design to the stated targets is what makes the disaster recovery both adequate and economical.

Recovery profile	Database approach	Application tier approach	Relative cost
Demanding RTO and RPO	Data Guard, current standby	Warm application tier ready	Higher
Moderate RTO and RPO	Data Guard standby	Built from code on failover	Medium
Relaxed RTO and RPO	Replicated backups	Built from code on failover	Lower

Rehearsing the failover

A disaster recovery design is only as good as its last successful rehearsal. The failover must be tested regularly, in a way that exercises both the database promotion and the application tier recovery, so the recovery time is measured rather than assumed and any problems are found in a rehearsal rather than a real disaster. The rehearsal also validates that the application works correctly in the recovery region, that the integrations connect, and that users can actually use the system after failover.

Rehearsing also keeps the recovery environment current. An environment that is rehearsed regularly cannot drift far from production, because the rehearsal would expose the drift. An environment that is never rehearsed drifts silently until the day it is needed, which is the worst possible time to discover that it no longer works. Regular rehearsal is the discipline that separates real disaster recovery from the appearance of it, a theme we cover across the DR runbooks in our disaster recovery cluster.

Common EBS disaster recovery mistakes

The most common EBS disaster recovery mistakes follow directly from the two tier nature of the application. The first and most serious is protecting only the database and neglecting the application tier, which leaves the estate unable to actually recover. The second is hand building the recovery environment rather than defining it as code, which leads to drift and an environment that fails when needed. The third is never rehearsing the failover, so the recovery time is an assumption and the first real test is a real disaster.

The fourth is setting no explicit recovery targets, which leaves the design without a yardstick and tends to produce something that is both more expensive and less reliable than a deliberate design would be. Avoiding these four is most of what separates a real EBS disaster recovery capability from a paper one. Our wider guide to common DR mistakes on OCI covers the patterns that apply across workloads.

How disaster recovery fits the operating model

Disaster recovery is not a one time build but an ongoing part of the operating model. The recovery environment has to be maintained as production changes, the failover has to be rehearsed on a schedule, and the recovery targets have to be revisited as the business need evolves. This ongoing care is why disaster recovery sits naturally within a managed service, where the rehearsal and maintenance are part of the run rather than a project that is done once and then neglected.

Treating disaster recovery as a living capability rather than a finished artifact is what keeps it working over the years. The migration to OCI is the natural moment to build it, the operating model is what keeps it real, and the rehearsal is the proof that it works. Our EBS on OCI workload service builds and maintains disaster recovery as part of running the estate, and the broader pattern is in our disaster recovery and high availability solution.

Documenting the runbook

A disaster recovery design needs a runbook that anyone on the team can follow under pressure, not knowledge held in one engineer's head. The runbook sets out the steps to promote the database standby, bring up the application tier in the recovery region, reconnect the integrations, validate the application, and redirect users. It also records the decisions that have to be made during a real event, such as who authorizes the failover and how the failback will be handled once the primary region is available again.

The runbook is exercised during every rehearsal, which keeps it current and confirms it is complete and accurate. A runbook that is written once and never tested tends to be wrong by the time it is needed, because the estate has changed and the steps no longer match. Tying the runbook to the rehearsal schedule keeps it honest. Our wider guidance on DR runbooks for OCI covers how to structure a runbook that holds up under real pressure.

Testing without disrupting production

A common worry about disaster recovery rehearsal is that the test itself will disrupt production, and a good design removes that worry. On OCI the application tier in the recovery region can be brought up from code into an isolated environment for testing, and the database standby can be opened for validation in a way that does not affect the primary. This means the failover can be exercised realistically without taking production down, so the rehearsal can happen on a regular schedule rather than being avoided because it is too disruptive.

Removing the disruption removes the main excuse for not rehearsing, which is why this matters. A disaster recovery capability that can be tested safely will be tested regularly, and one that cannot be tested safely tends not to be tested at all. Designing the rehearsal to be non disruptive is therefore part of designing a disaster recovery capability that is actually maintained over time rather than allowed to decay.

Failback after the event

Disaster recovery planning often stops at the failover and neglects the failback, which is the return to the primary region once it is available again. A complete design covers both directions, because an estate running in the recovery region after a real event eventually needs to return to its normal home, and that return is itself a controlled procedure with its own data synchronization and cutover. Planning the failback in advance means the team is not improvising the return under the lingering pressure of a recent disaster.

The failback uses the same disciplines as the failover: a database synchronization back to the primary, an application tier brought up from code, validation, and a controlled redirection of users. Treating failback as a first class part of the design, rehearsed alongside the failover, is what makes the disaster recovery capability complete rather than a one way trip that leaves the estate stranded in the recovery region.

Free white paper

Go deeper on this topic with The OCI Migration Playbook, a step by step framework for planning and running an OCI migration with less risk. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Disaster Recovery — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.