Change Management for OCI Estates

Ask any experienced operations team where outages actually come from and the answer is almost always the same. Not lightning strikes, not exotic attacks, not hardware that quietly dies. The most common cause of a production incident is a change that someone made. A configuration tweaked, a rule edited, a shape resized, a policy updated, each done with good intentions and each capable of breaking something that was working a minute earlier. Change management is the discipline of making those changes safely, with enough process to catch the damage before it reaches users, and on a live OCI estate it is one of the quietest and most valuable things a managed service brings.

Why change is the riskiest moment

A stable system that nobody touches tends to stay stable. The moment of risk is the moment of change, because that is when a working state is replaced by a new one that has not yet proven itself. The paradox is that you cannot simply stop making changes, because the estate has to evolve, security has to be patched, capacity has to be adjusted, and new applications have to be deployed. So the goal is never zero change. The goal is change that is planned, reviewed, reversible and recorded, so that when something does go wrong, and eventually it will, you know exactly what changed, when, by whom and how to undo it.

Without change management, an estate accumulates undocumented modifications until nobody can say with confidence what the current configuration actually is or why it looks the way it does. This is how estates become fragile. Every undocumented change is a small mystery, and a system full of mysteries is a system nobody dares to touch, which is its own kind of failure.

The three types of change

Mature change management does not treat every change the same way, because that would be either too slow for routine work or too loose for risky work. Instead it sorts changes into categories that each get a proportionate amount of process.

Change type	What it is	Process
Standard	Routine, low risk, well understood, done many times before	Pre approved, follows a known runbook, logged but not individually reviewed
Normal	Non routine change with real risk to assess	Reviewed and approved before it runs, scheduled into a window
Emergency	Urgent change to fix or prevent an incident	Expedited approval, full review after the fact

Standard changes are the routine work of an estate, such as applying a tested patch or adding a known type of resource. Because they are well understood and have been done safely many times, they can be pre approved and run from a runbook without individual sign off, which keeps the operation moving. Normal changes carry real risk and so are reviewed and approved before they run, then scheduled into a window where the impact of a problem is contained. Emergency changes are the exception, made urgently to stop or prevent an incident, and they get expedited approval up front with a full review afterward to confirm the right call was made.

The goal is never zero change. It is change that is planned, reviewed, reversible and recorded, so that when something breaks you know exactly what changed and how to undo it.

What a good change record contains

The record is the heart of change management, and a weak record makes the whole process theatre. A change that is approved but poorly described provides no real protection, because when something goes wrong the record cannot answer the questions that matter. A good change record captures what is being changed and why, who requested and approved it, when it will run and in what window, what the expected impact is, how the change will be verified as successful, and crucially how it will be rolled back if it is not. That last point, the rollback plan, is the one most often skipped and most often needed. A change without a tested way back is a change that can turn a small problem into a long outage.

The change lifecycle

A normal change moves through a predictable sequence, and the value of the process is that each step catches problems the previous step missed.

Request. The change is proposed and described, including its purpose, scope and expected impact.
Assess. Someone other than the requester reviews the risk, the blast radius and the rollback plan.
Approve. The change is authorised, or sent back for more detail, by whoever owns the risk.
Schedule. The change is placed in a window where impact is contained and the right people are available.
Implement. The change is made, ideally from a runbook, with the implementer watching for the expected result.
Verify. The system is checked against the success criteria defined in the request, not just assumed to be fine.
Review. The change is recorded as complete, and anything that went sideways feeds back into the process.

The discipline is in the separation of duties at the assess and approve steps. The person who wants to make a change is the worst person to judge whether it is safe, because they already believe in it. A second set of eyes catches the assumptions the requester cannot see, which is the entire point of review.

Change windows and blast radius

Even a well reviewed change can go wrong, so timing matters. Running a risky change at the busiest hour of the business day, when an error reaches the maximum number of users and the maximum amount of revenue, is asking for the worst possible version of a bad outcome. Change windows exist to shrink the blast radius by running changes when impact is naturally lowest, and by making sure the people who can fix a problem are awake and available rather than scrambling in the middle of the night. On a global estate the right window is not always obvious, because quiet hours in one region are peak hours in another, and a managed service that understands the workload picks windows that genuinely minimise risk rather than ones that are merely convenient.

Configuration drift, the silent enemy

Change management is the front line, but its quiet companion is configuration drift. Drift is what happens when the live state of the estate slowly diverges from what the records and the infrastructure code say it should be, usually through small manual changes made outside the process. Each individual drift is minor, but they accumulate into an estate that no longer matches its own documentation, which undermines every assumption the operations team makes. The defence is to make infrastructure as code the source of truth, to apply changes through that code rather than by hand, and to detect drift when it appears so it can be corrected before it spreads. This connects change management directly to runbook automation, because a change executed by a tested, automated runbook drifts far less than one typed by hand under pressure.

Why change management pairs with managed services

Change management is exactly the kind of discipline that erodes under pressure when a team is stretched. When everyone is busy and a change seems urgent, the temptation to skip the review and just make it is strong, and the day it works is the day the habit forms. A managed service holds the line because the process is the service, not an overhead on top of it. The reviews happen because reviewing is the job, the records are kept because records are the deliverable, and the windows are honoured because honouring them is what a calm operation looks like. This is one of the reasons proactive operations stay stable while reactive ones lurch from incident to incident.

Change management is one discipline within a complete operational practice. For the full scope see the complete guide to OCI managed services, and for how the day to day work is staffed see in house versus managed OCI operations. When you want change handled with this kind of process on your estate, our OCI managed services practice runs it as part of the standard operation.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.