DR Runbooks for OCI · OCI Specialists

A runbook is the document a tired engineer opens during the worst hour of their quarter. If it is vague, out of date, or assumes knowledge that walked out the door last year, the recovery slows down exactly when speed matters most. A good runbook is the opposite: precise, current, and written so that the person executing it does not have to think, only follow.

This article sets out what a disaster recovery runbook for Oracle Cloud Infrastructure must contain, how to structure it so it holds up under stress, and how to keep it alive rather than letting it rot. Even in a highly automated estate the runbook still matters, because automation needs a human owner who understands what it does and a fallback for the steps the machine does not cover.

What a runbook is for

The purpose of a runbook is to remove judgement from execution. During a real incident, the people running the recovery are stressed, possibly woken from sleep, and working against the clock. Every decision they have to make is a chance to make the wrong one. A good runbook front loads all the thinking into calm moments before the incident, so that during the incident there is nothing left to decide except whether to start.

This is also why a runbook and an automation plan are partners, not competitors. The automation executes the precise steps. The runbook tells the human what the automation does, when to invoke it, how to confirm it worked, and what to do if a step fails. Our guide to DR automation on OCI covers the execution engine; this article covers the human side that wraps it.

The anatomy of a working runbook

Every DR runbook for an OCI workload should contain the same core sections, in the same order, so that anyone who has used one can use them all.

Scope and trigger. Which workload this covers, and the exact conditions under which it is invoked. Ambiguity here causes both late and premature failovers.
Roles and contacts. Who declares the disaster, who executes, who communicates, and how to reach each of them, with a named backup for every role.
Pre checks. The state to confirm before starting: replication health, standby readiness, and any condition that would make failover unsafe.
The failover sequence. The ordered steps, each one a single unambiguous action with the exact command, console path, or automation plan to run.
Validation. How to confirm the workload is actually serving correctly in the standby region, not just that the steps ran.
Communication. What to tell whom, and when, during and after the event.
Failback. How to return to the primary region once it recovers, which is often harder than the failover itself.

Write every step so that someone who has never seen the system can execute it correctly. That person might be you, at four in the morning, a year from now.

Writing steps that survive pressure

The quality of the failover sequence is what separates a useful runbook from a dangerous one. Each step should be a single action with a single expected result. Avoid steps that say "configure the database" and instead give the exact procedure or the exact automation plan name. Include the expected output so the executor can confirm success before moving on. Where a step is irreversible, mark it clearly, because the knowledge that a step cannot be undone changes how carefully it is done.

Number the steps and never rely on the reader inferring order from prose. Put the exact OCI region names, compartment names, and resource names in the document, because the standby region is unfamiliar territory and guessing a name is how mistakes happen. If a step depends on the output of a previous step, say so explicitly. The discipline is to assume the reader knows nothing and is under stress, because on the day, that assumption will often be close to true.

Runbook formats compared

Format	Strength	Weakness
Wiki page	Easy to edit, searchable	Rots silently, no version control, easy to leave half edited
Version controlled document in the repo	Changes are reviewed, history is visible, lives next to the code	Needs discipline to keep open during an incident
Runbook tied to an automation plan	The document and the executed steps cannot drift apart	Requires investment in automation first
Printed copy in a binder	Survives a total platform outage	Out of date the moment it is printed unless refreshed

The best practice for most teams is a version controlled document stored with the infrastructure code, kept in step with an automation plan, plus a recent exported copy held somewhere reachable if the primary documentation system is itself unavailable. That last point matters more than teams expect: if your runbook lives only in a system hosted in the region that just failed, you have no runbook.

Keeping the runbook alive

A runbook is only trustworthy if it matches reality, and reality changes constantly. The way to keep it current is to tie its maintenance to two events: every architecture change and every rehearsal. When the architecture changes, updating the runbook is part of the change, not an afterthought. When you rehearse, every step that did not work exactly as written is a defect in the runbook to be fixed before the next rehearsal.

This is why testing and runbooks are inseparable. A runbook that has not been executed against the real environment recently is a work of fiction, however carefully written. Our guide to DR testing on OCI describes the rehearsal cadence that keeps runbooks honest, and our note on common DR mistakes on OCI covers the failure modes that an untested runbook hides.

Failback: the step everyone forgets

Most runbooks describe how to fail over to the standby region in detail and then stop, as if the emergency ends there. But running in the standby region is not a steady state, it is a temporary posture, and at some point you have to return to the primary. Failback is often more complex than failover because the data has changed while you were running in the standby, and that change has to be reflected back to the primary before you switch.

A complete runbook treats failback as a first class procedure with the same rigor as failover: pre checks, an ordered sequence, validation, and communication. Teams that skip this end up improvising the return, which carries its own risk of data loss. Plan the round trip, not just the outbound leg.

From document to capability

A runbook on its own is not disaster recovery, it is one part of a capability that also includes replication, automation, testing, and clear ownership. But it is the part that ties the human to the machine, and a recovery that is otherwise well engineered can still fail if the person at the keyboard does not know what to do. Invest in the runbook the way you invest in the architecture, keep it current the way you keep the code current, and rehearse against it until following it is routine. That is what turns a plan on paper into a recovery you can count on. For the full estate view, see our pillar guide to disaster recovery and high availability on OCI.

Communication is part of the runbook

A technical recovery that succeeds while customers are left in the dark is only half a success. The runbook should include the communication plan as a first class section: who is told what, through which channel, and when, during and after the event. This includes internal communication so the wider organization knows what is happening, and external communication so customers, partners and where relevant regulators are informed appropriately.

The reason to put this in the runbook rather than leave it to judgement is the same reason the technical steps are scripted: under pressure, people communicate badly or not at all unless they have a prepared plan to follow. Prepared message templates, a clear list of who contacts whom, and defined timing remove the cognitive load of inventing communication during a crisis. The technical recovery and the communication should proceed in parallel, each driven by the runbook, so neither waits on the other.

Storing the runbook where you can reach it

A runbook that lives only in a system hosted in the region that just failed is no runbook at all, because you cannot open it when you need it. This sounds obvious and is missed constantly. The runbook, and the contact details it contains, must be reachable when the primary environment is down. That means a copy outside the affected region, whether in a separate documentation system, a repository hosted independently, or a regularly refreshed exported copy held somewhere reliably reachable.

The same applies to access credentials and the means to authenticate to the recovery environment. If the only path to the standby region runs through an identity system in the failed region, the recovery stalls at the first step. Designing the runbook's own availability is part of designing the recovery, and it is tested the same way: by confirming, during a rehearsal, that the runbook and the access it requires are reachable with the primary region assumed gone.

Right sizing the level of detail

There is a balance to strike in how detailed a runbook is. Too vague, and it fails the test of being executable by someone under stress. Too detailed, and it becomes so long that nobody maintains it and the important steps are buried in trivia. The right level is one where each step is a single unambiguous action with its expected result, and the document is short enough to be kept current and read end to end during an incident.

A useful test is to hand the runbook to someone who did not write it and is not deeply familiar with the system, and ask them to walk through it in a rehearsal. The points where they hesitate, ask a question, or guess are the points where the runbook is too vague. The sections they skim because they are obvious are candidates for trimming. This kind of review, done as part of regular testing, keeps the runbook at the right level over time. Our guide to DR testing on OCI covers how to fold runbook review into rehearsals.

Ownership keeps the runbook alive

Every runbook needs a named owner who is responsible for keeping it current, or it will rot. Without clear ownership, updating the runbook falls to nobody, and the document drifts away from reality until it is useless. The owner does not have to write every change, but they are accountable for the runbook matching the system it describes, and for ensuring it is rehearsed on schedule. Tying this ownership to the same team that owns the workload keeps the runbook close to the people who know when the system changes.

The first hour matters most

The structure of a good runbook reflects the reality that the first hour of an incident is where most of the damage is done or avoided. The opening sections, the trigger conditions, the decision authority, and the first few failover steps, are the ones that get used under the most pressure and the least clarity. These deserve the most care, the clearest wording, and the most rehearsal, because a mistake here cascades through everything that follows.

Putting the most critical and time sensitive material at the front, in the clearest possible form, and pushing the reference detail toward the back, matches the runbook to how it is actually used. The executor needs the trigger and the first steps immediately and under stress, and can consult the deeper detail more calmly once the recovery is underway. A runbook organized this way serves its reader at the moment of greatest need.

Free white paper

Go deeper on this topic with The OCI Disaster Recovery Blueprint, cross region resilience without doubling the bill. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Disaster Recovery — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.