VM Replication & Resiliency: Three Common Hurdles For SMBs Part 3: Failover Testing

Geographic redundancy is not just for big enterprises. Small and medium-sized businesses (SMBs) can take advantage of it to protect their critical apps and keep downtime to a minimum. How, you ask? Well, if you’re running the apps on VMware virtualization, then VM replication technology and expert managed hosting are a good place to start.

In this three-part blog series, I’ll cover the following common challenges that IT managers face when considering a resiliency solution.

Top 3 Challenges:

  1.  Cost
  2. Complexity
  3. Failover Testing

You’re Free to Test, But Testing Isn’t Free
Remember my first blog installment? I defined failover as the process of switching to the backup infrastructure in the secondary DC after a major disruption causes the apps in the primary data center to become unavailable. Testing failover and the subsequent failback can be challenging, especially for SMBs. It requires time, resources and a ton of planning.

It also involves risk. With a full failover/failback test, you’re putting your production workloads on the line. What happens if the failover, well, fails? Or if the failback doesn’t bring up your primary production environment as expected? This uncertainty is precisely why extensive planning must happen.

There’s a substantial cost related to every time you perform a full failover test including: the time it takes to plan, the personnel resources who are on hand to manage the failover and failback and any charges from the service provider for performing the full test.

Testing…1, 2, 3?
How often should you fully test the failover? Unfortunately, the answer is, it depends. Testing is needed at an interval that makes sense for your company and your budget. Your production environment is in flux – data is changing and growing, new code is being pushed for apps, operating systems are being patched, hypervisors and VMs are being added, bandwidth requirements are increasing, etc. As part of any sound DR strategy, it is recommended that you execute the failover runbook as part of a real-world test of the failover process.

While there is no substitute for a real-world test between data centers, there is a way to supplement this occasional drill with more frequent snapshot-based tests. You can quickly and affordably simulate how your replicated production VMs would respond if they were restarted in a different DC.

Some replication software or managed services offer the ability to create a snapshot of the critical VMs being replicated provided that you have enough extra storage space in the redundant infrastructure. This test only occurs in the secondary DC and doesn’t involve your production environment; thereby removing the risk and extensive planning required for a full failover test.

Take a look at the graphic below. It represents a snapshot-based failover test of the replicated VM 2. You’ll notice that the replication process continues uninterrupted, and the replicated VM 2 remains powered off. In Data Center 2, a snapshot of the offline VM is created, then powered on, tested and finally deleted. This test is quick, easy and doesn’t require planning or an IT team on hand. This test is contained in a sandbox environment and doesn’t affect your production VMs or even the replication process.

Although SMBs should still perform a full DC-to-DC failover and failback test, snapshot-based test can be done quickly and often. When major changes are replicated to the VMs in the secondary DC, an SMB can do a quick check to see if their critical apps will start up and run properly. More importantly, it can be done with no cost, minimal team distraction and zero risk to the production environment.

Quick recap…

  • A full failover/failback test is challenging for an SMB
    • Requires time, resources and tons of planning
    • Puts the production environment at risk of becoming unavailable
  • Testing still needs to happen because the production environment is changing/growing
    • SMBs can supplement full failover tests with more frequent snapshot-based tests
    • Snapshot-based failover tests remove risk to production, require very little time and team resources
    • It’s a cost-effective way to test major changes to critical replicated VMs

Want to learn more about VM replication and resiliency, and how to overcome these hurdles? Check out this presentation on SlideShare: VM Replication Is Your Lifeline When Disaster Strikes.

Brent is a Product Manager for the VMware Cloud Practice. Prior to product management, he owned solution marketing across the Rackspace Managed Private Cloud portfolio. A Racker since 2012, Brent has more than 15 years in software, cloud and managed IT services focusing on product lifecycle management, product strategy and roadmap, implementation of agile principles, go-to-market strategy, content marketing and sales/partner enablement. He hails from Southern California, but currently lives in Austin. While working for DreamHost in Los Angeles, Brent founded the OpenStack LA user group. On weekends, you might find him on a stand up paddle board floating down Lady Bird Lake, or riding his mountain bike down the trail. You can follow his cloudy interests on Twitter @BrentScotten.



Please enter your comment!
Please enter your name here