In my time as a staffer within the Department of Defense, and as a contractor for numerous government customers, I found almost nothing more widely misunderstood than “DR/COOP.”
Whether it was an unwillingness to acknowledge the differences between the two, incorrect assumptions about each or the game of primary responsibility hot potato too often played by IT and mission owners, few ever seemed to have a complete grasp of it, leading to a false sense of security at best, and in many cases, irreparable damage.
In an effort to help your agency avoid a similar fate and develop realistic and functional disaster recovery and continuity of operations plans, I’ll walk through three of the most common mistakes I still see and highlight ways to avoid making them.
Ignoring the reality that ‘Everything fails, all the time’
The first and most common DR/COOP mistake is simply denying the need to plan for failures and instead attempting to design an architecture that “never fails.” Inevitably, what is ultimately created is an architecture that simply cannot tolerate, or gracefully respond to, any failure.
As Werner Vogels, CTO of Amazon, puts it, “Everything fails, all the time.” Ignoring this basic principle generally results in a system that more closely resembles a house of cards than an impenetrable fortress. Instead, it’s critical to build a system that automatically heals itself when minor problems arise and can be easily reconstituted in the event of a catastrophic failure.
To properly plan for failure, assume failures will occur and ensure you have built plans for how these failures will be handled. Use automation where possible and consider the implications of failure during each step of the design process.
In some cases, however, automation isn’t possible, and in others, a manual process may be preferred. You don’t want a computer to make a binary decision to initiate a two-hour failover process when it was just a failed package installation that can be backed out in five minutes. In an age of trying to apply artificial intelligence and machine learning to all the things, a little human intellect can go a long way.
Conflating the two
Coming in a very close second is the mistake of conflating DR plans and COOP plans, and assuming by having one you have the other. This couldn’t be further from the truth. Disaster recovery is focused on IT, whereas continuity of operations involves maintaining delivery of all essential aspects of government despite disruptive events. Disaster recovery is usually subset of continuity of operations, but the latter is more broadly applicable to people, processes and functions than technology.
Continuity of operations is how your organization operates during a major disruption, while disaster recovery is how you reconstitute an IT system after a major disruption. Each IT system likely has its own disaster recovery plan accounting for multiple disaster scenarios, while your entire organization may only have a single continuity of operations plan.
Confusing HA, FT and DR
IT and the federal government both love their alphabet soup. With all the acronyms, it’s easy to confuse or misuse them, especially those closely related like HA, FT and DR — high availability, fault tolerance and disaster recovery.
Often, I hear people attempt to “rank” these options, such as: HA > FT > DR. Then they think, high availability is the best, and I already have that, so I don’t need disaster recovery. Right? Wrong. DR plans are created to address major disasters, not service interruptions. Fault tolerance and high availability address localized losses of individual system functionality. Disaster recovery is about widespread losses of functionality, potentially even external to your highly available or fault tolerant system.
For example, what happens when your organization’s core router goes down? In most well architected networks, the core router itself is generally configured as part of a FT or HA pair. But what happens if they both fail? Your entire network goes down. How would that happen? Something that impacts your entire headquarters building, such as a fire or flood. An overzealous backhoe operator, perhaps? (I’ve actually been through that with more than one organization.)
This is a legitimate disaster, nothing that FT or HA system design could really account for. You might be thinking: Well, that’s why my application runs in the cloud! Sure, your application continues chugging along, but nobody in your office, or anywhere on your corporate network, can get to the cloud.
In this specific scenario, you would probably need to activate your COOP plan to enable alternate work locations until your headquarters is restored to normal operations, as well as your DR plans to restore access to IT systems.
While those three mistakes are some of the most common, they’re hardly the only ones. Here’s a list of a few more that didn’t make the cut; I’ll bet many of you can relate:
- Impossible RPO/RTO targets
- Everything is “Mission Essential”
- Ignoring cost as a factor
- DR environments more robust than production
- My COOP plan has my entire staff working from a location with no network access to my DR site
Avoid the mistakes: work with a trusted partner
Continuity of operations and disaster recovery planning remain vitally important, even in the cloud. They are not interchangeable. Organizations need both. When in the assessment phase of planning, make sure you think about all possible failures — and don’t be tempted to believe that High Availability and Fault Tolerance supplant the need for disaster recovery.
If your organization is looking for additional guidance, Rackspace can assist. We offer unbiased expertise across a range of leading cloud infrastructure technologies, built on a compliance-ready framework and backed by ongoing managed operations, continuous monitoring, compliance documentation and audit assistance. We take a security first approach, working with each of our government and commercial customers to ensure defense in depth and a comprehensive plan for disaster recovery.