Lessons from a Successfully Failed Disaster Recovery and Failover Test

Conducted during a busy release weekend, the failover test exposed gaps not in the technology itself, but in coordination and communication. While production ultimately stayed unaffected, the situation quickly escalated as subcontractors weren’t aligned, assumptions didn’t match reality, and information didn’t flow when it mattered most.

We unpack how a well-intentioned test turned into a coordination challenge, where timing, dependencies, and unclear responsibilities created confusion across teams. It’s a story about how resilience isn’t just about systems and infrastructure, but also about people, processes, and making sure everyone is on the same page — especially when things are supposed to “just be a test.”

Listen now on Apple Music, Spotify, Deezer, Youtube or where-ever you get your panic attacks.

Welcome to another episode of IT Horror Stories—today, we bring you an epic saga from the trenches of enterprise IT. This is not the story of a smoothly executed disaster recovery test. Nope, this is about the failover that, well… technically succeeded and failed at the same time. Grab your coffee, your stress ball, and some popcorn—let’s get into what happens when everyone does things “by the book” and it all still goes sideways.


Setting the Scene

Jack Smith: Welcome back, everyone. This is Horror Stories with Jack Smith. I’m Jack Smith and, across the table, I have Bob.

Bob: Hi guys and girls and everybody in between, however you like to roll. Everybody is welcome. Every single one…

Let’s get comfy. This time, we’re talking about a release and failover process that went so “by the book” that it left us laughing, crying, and desperately searching for someone to blame other than ourselves… and possibly our supplier’s suppliers’ suppliers.


Corporate Environments: Big Fish, Bigger Pond

It all happened about a decade ago (give or take—we try not to count those years too closely). Our protagonist, Bob, was working for a large enterprise—think way larger than your average small business. We’re talking about organizations that don’t just have suppliers, they have suppliers with suppliers, and sometimes you need a full map to know who called who.

The Cast of Characters:

  • The Company: Handles its own app management and development. Architects, analysts, developers, testers—all present.
  • Third-Party Devs and Testers: Because, you know, offshore is cheaper.
  • An Infrastructure Supplier: Because who wants to own all those racks and blinking lights?
  • Hosting/Data Center Provider: The suppliers’ supplier. You see where this is going.

This is a story from an environment where more people are involved in launches than a small nation’s moon program.

Quote Highlight:

“A tad bigger than a small business, indeed. And essentially the story here happened about 10 years ago. I tend to forget my age or I don’t want to remember my age.”


Failover Planning: Herding Cats With Clipboards

1. Annual Failover—“Because Regulators Said So”

Every year, in this regulated industry, the company had to prove that its failover system, disaster recovery setup, and business continuity machines actually worked. If you’re just picturing flipping a switch—stop. We’re talking :

  • At least two data centers
  • A planned failover to the backup center for a full week of real business work
  • A “fail back” to primary at the end (fingers crossed)
  • Different server types: mainframe, AS/400, Windows boxes, a dash of cloud, and some SaaS
  • Dozens (sometimes hundreds) of people babysitting releases because… well, see above

Quote Highlight:

“So, so far nothing out of, you know, ordinary. That sounds dangerous. No, indeed, indeed.”

2. Release Weekends: The Ritual

  • Four times a year: Rolling changes from UAT (User Acceptance Testing) to Prod
  • Up to 300 people in the building on a Sunday
  • Saturday = IT’s technical deploy, basic checks
  • Sunday = Business teams swarm in and try to break stuff (er, validate)
  • Every step mapped out, every checklist ticked

Agile? Kind of. Waterfall? Mostly. Budget? Let’s just say: big.


Double Disaster: Because One Isn’t Enough

So, this particular year, management had a bright idea:

“Since we already have all these people around for the release weekend, why not combine the disaster recovery test with one of those weekends? We get two birds with one stone. Cheaper for us!”

What Could Go Wrong?

  • Infrastructure provider: Let’s align their release weekend with ours
  • Double activity weekend: Application failover and infrastructure partner’s UAT-to-production moves at the same time
  • Us: “Isn’t that asking for trouble?” (Answer: oh yes.)

Critical Failure: “Works for Me!”

Start of the Weekend

Saturday morning. Caffeine. Cautious optimism. Everyone following the plan to the letter:

  • UAT promotes are in motion
  • Infrastructure partner is busy, but in their own “non-overlapping” world
  • Failover to backup data center: check. All green lights.

And Then… Validation Time

Nothing. Works.
Not a single login screen. Not even the “Citrix is starting up” spinner.

“Validation part, nothing worked. And when I say nothing, I mean absolutely nothing worked. We couldn’t even log in anymore. It was a Citrix environment. We could not log in into our own accounts anymore. Everything broke.”

  • 11pm. Everyone’s tired
  • Can’t login, can’t check servers, can’t run tests
  • Incident call time: Us, our incident managers, infrastructure partners, a growing Zoom call… you know the drill

The Realization Moment

Now, here’s where the magic happens:

“Looks Fine to Us!”

Infrastructure and hosting partners both look.
“All our systems are green!”
“We can see the servers, network is up!”

But…

We can’t even ping the backup data center.
No traffic gets through. It’s as dead as a doorknob.

“Luckily, several people on our side do have some infrastructure knowledge. So we didn’t have the rights, but the infrastructure part gave us the rights so we could perform some network traces and stuff like that. And essentially we came to the conclusion that we had no network connectivity in the secondary data center.”

Hosting Provider’s Turn

After two hours of hair-pulling conference calls, the hosting partner’s tech eventually pipes up:“Which data center are you trying to connect to?”

  • Us: “Number two—the backup.”
  • Hosting provider: “Oh. Uh. Hold on a second.”

That, friends, is the sound of “Oh shit.”


Split Brain: When Systems Can’t Decide

So, it turns out:

  • Hosting provider thought: Since we were active and infrastructure partner was active, why not ALSO do their disaster recovery test? They fail over the network from backup data center back to the primary.

So, in the end:

  • One data center: All apps and servers ready to respond—but no network.
  • The other data center: Just network—no apps, no services.

Both suppliers ran their failover scripts at the exact same time—in opposite directions.

Quote Highlight:

“Ouch. We had one data center with zero applications and services running with network, and we had another data center with everything running, throwing errors everywhere because we had no network. But still, you could claim that both failovers were a success.”

The “No Going Back” Policy

To add insult to injury, both the company and the suppliers had a policy:

“A failover can’t fail. Once you start, there’s no way to abort and roll back.”

So, the backup had to be resurrected—there was no option to simply “try again” on another day.


The Recovery: No Rest for the Weary

Hot Potato: Who Goes First?

  • Suppliers’ suppliers: “Let’s finish our disaster recovery from backup back to primary. When done, we’ll return the network and you can do your business.”
  • Us: Wait for them to finish, then start our mad dash through IT validations and business validation.

Time Lost

  • Their tests wrapped up by early Sunday afternoon (~2pm). Only then could we start our actual work.
  • That meant 12–15 hours behind schedule. The precious sleep-time window for IT folks? Gone.
  • IT validation rushed through in just a few hours.
  • Business validation crammed in: 6pm Sunday evening.

Result:

  • Final all-clear came at 6:30am Monday morning.
  • Doors open for office at 7:30; business as usual starts at 8:00.
  • Relief, exhaustion, and a sense of “that could have gone a lot worse.”

Lessons Learned: Communication Breakdown

Looking back, nothing in the internal plan was wrong:

  • Roadbooks? ✔️
  • Checklists? ✔️
  • Stakeholder communications? ✔️

But left-hand and right-hand at the supplier chain weren’t talking. And nobody had a holistic view.

Quote Highlight:

“It was just a case of at the supplier side, left hand didn’t really talk to the right hand.”

Main Causes

  1. Assumptions:
    • Hosting provider assumed their disaster recovery was “low impact.”
    • Infrastructure partner assumed the same.
  2. Missed Communication:
    • Each group only told their next-in-line about big changes.
    • Anything “non-impactful to customers” didn’t make it onto our change log.
  3. Size = Complexity:
    • With big organizations, it’s easy for a critical memo to never reach the exact people who need it.
  4. Holy Change Boards:
    • Yes, there were change boards. But if a change is considered low-risk, no notification is needed. Until it isn’t.

Failovers, Due Diligence, and Expensive Sleep Loss

The Magic Question

From that point on, Bob always demanded, before any major event:

“Who are your suppliers? Can I get confirmation from EVERYONE along the chain—provider, infrastructure, hosting—that nothing will be happening, no matter how low-impact it might seem?”

Sure, most of the time, it’s just a few calls or emails. But it’s so much easier than losing a weekend, paying for a dozen hotel rooms, and living on vending machine food.

Resistance

Getting everyone to sign off isn’t always easy:

  • Sometimes, risk acceptance comes from leadership: “We’ll just take the risk this time.”
  • In that case: “Fine, just sign right here and say you’re aware.”

9 out of 10 times nothing happens. The 10th? Well, it makes for an epic horror story.

Change Boards: Blessing and a Curse

  • Not every little change should be communicated—otherwise, everyone drowns in emails.
  • But if it even touches the backup data center, network infrastructure, or overlaps with a failover weekend?
    • Communicate. Please.

Conclusion: TL;DRs and Takeaways

So what’s the real lesson of “The Failover That Failed Successfully”?

  • Talk. It doesn’t matter if you own the infrastructure or outsource it. If you’re tied to suppliers, and especially suppliers with suppliers, demand explicit “nothing will happen” confirmation before doing anything critical.
  • Don’t stack disaster recovery tests unless you have an absolute, air-tight chain of communication.
  • Be wary of “low impact” changes. To you, it’s a checkbox. To someone else, it’s the trigger for a 14-hour crisis.
  • Check your suppliers’ suppliers. And get them all in the (virtual) room.
  • Never assume “no impact” means “absolutely no impact.”

Final Quote Highlight:

“If you have something extremely critical, validate the entire chain. Even if it’s just for your own peace of mind. It’s usually a few mails, a few calls and you can sleep tight and you can avoid these type of things.”


Some Visuals and Checklists

Quick TDA template

graph TDA["You (Organization)"]B["Infrastructure Partner"]C["Hosting/Datacenter Provider"]D["Suppliers' Supplier"]A --> BB --> CC --> Dclick A href "mailto:your-it-team@example.com"click B href "https://supplier-team-portal.com"

Release Weekend To-Do List

  • [x] Align release weekends and failover weekends
  • [x] Confirm all change boards know about critical events
  • [x] Send explicit “no activity” requests to every supplier and sub-supplier
  • [x] Schedule overlap check meeting for all parties
  • [x] Build in buffer time (and sleep time)
  • [x] Organize on-call rotations and hotel rooms as backup
  • [x] Double-check assumptions, especially “low risk” ones

Wrap-Up: Why We Share These Stories

Some IT war stories are funny in hindsight. Some are painful. This one’s a bit of both—with a solid “don’t let this happen to you” message for every ops engineer, project manager, or C-level exec who thinks, “Let’s just combine tasks to save money!”


Final Words

You don’t always need a highly polished, expensive “lessons learned” write-up to stay safe. Sometimes all you need is to ask, and ask again. That, and maybe don’t do parallel failover drills unless someone signs in triplicate.

Thanks for reading. You are one of us.

“On paper we did everything right. We had the road book. Everything was communicated, everything was validated. It was just a case of at the supplier side, left hand didn’t really talk to the right hand.”

Stay safe. Communicate. And always check the backup data center.



Leave a Reply

Your email address will not be published. Required fields are marked *