I hope you’ve been following our in-depth Building a Disaster Recovery Solution using VMware vCenter™ Site Recovery Manager™ (SRM) series. Today, I’m back to walk through the fourth and final leg of our journey. We’ve come a long way, and you should have a nice disaster recovery (DR) solution in place with test plans and milestones for your testing. Like a fire extinguisher, hopefully you’ll never actually have to use them.
We all wish DR was a simple “set it and forget it” activity like programming an old VCR or DVR, but it doesn’t quite work that way. Systems are fluid. They change from day to day with virtual machines (VMs) being spun up and torn down so irregularly that it’s hard to keep track. But we NEED to keep track so we’re ready if, and when, disaster strikes. For those of you who are more visual learners, check out this video.
Maintaining the beast takes a little time
If you remember back to the beginning, we talked a lot about planning and creating Visio diagrams to map everything. Now, any time your systems change, you need to pull out those docs and update them. Will this be tedious? It might be. Will it be trivial? Absolutely not! If you get in the habit, it’s not that big a deal to make updates. If you have existing change control teams and processes, hopefully they help keep these docs up to date by pestering you to do it or actually doing it themselves. Either way, these docs need to stay current so you know for sure that everything is protected and all integrations are documented.
The most common change you’ll probably see is the addition of VMs into the SRM environment, then nothing else done. Yes, the VM is replicated if you configured VMware vSphere® Replication™ for the VM or replicated it on a datastore. SRM hasn’t added it to “the list” yet, so you’ll need to go into the protection group and select either the “configure all” button or configure protection for that VM. Luckily, this is one of the easiest changes to track. Do the same thing when decommissioning a VM.
When you delete a VM in vCenter, an alert will pop up reminding you that another application manages this VM. That should be a prompt to go into SRM and remove protection for that VM, but more often than not, it doesn’t get done. It’s less of an issue in a DR event. SRM will try to import and recover a non-existent VM, causing the recovery to stop, delaying your recovery or extending your recovery time. That’s better than expecting SRM to recover a VM that someone forgot to configure in SRM. You will have to remove the deleted VM and rerun the recovery.
Another thing you may see sporadically is the creation of additional port groups at the source site and SRM-protected VMs using them. This goes back to change control and keeping your docs up to date. Any time a new VM requires a new network, the corresponding port group should be created at the target side and mappings put in place. Otherwise, protecting that VM in SRM will fail because the mapping is missing. Failures…well…they fail, and you don’t want your systems to fail.
Since we’re talking about adding new VMs to SRM, a typically associated task is adding datastores. To make sure everything works well, ask your storage team to make sure this new volume or LUN is added to replication if you’re using array-based replication. Once you know that’s done, add the datastore to the hosts. One word of caution: Before you start adding VMs to the new datastore, go into the Array Manager in SRM and refresh it, validating that you can see this new datastore in the Devices tab. While you’re in SRM, move down to the protection groups section and add your new datastore to the protection group or create a new one, depending on your plan. Now you’re ready to add VMs to this new datastore and protect them in SRM.
It’s rare to remove datastores, but it does happen, so let’s cover that scenario, too. If you’re upgrading to a shiny new frame, you get to add new datastores and decommission the old ones. The process is the same, worked backwards. After using Storage vMotion to move everything over to the new datastores, unmount and completely remove the datastore from vCenter. Next, remove from replication and delete the volume or LUN from the storage directly. Then go back into SRM and refresh the Array Manager, validating in the Devices tab that the datastore no longer shows up. For grins, you can even check the protection group, but I’ll bet you that it’s gone already.
One thing to note early on is that if you create separate folders for VMs in the VMs and templates view (e.g., for apps, priorities, ACLs, etc.), you need to make sure you have 1:1 mappings for those folders at the target site. Otherwise, you may failover your many-segregated VMs into one folder. Failing back to your production data center puts all VMs back into a single folder.
Migrating an environment is tricky!
SRM is designed for enterprises with two data centers (DCs)—a production and a DR site. It doesn’t include functionality for migrating between different pairings. For example, if you wanted to change where a specific application or server fails over from either your production or DR site to another DC—so actually migrate from pairing 1 (DC1-to-DC2) to pairing 2 (DC1-to-DC3)—the process would be painful because you can’t export the configuration from one vCenter and import it into a new one. You have to practically rebuild the entire thing. It would be simpler to just break the connection and reconfigure the SRM connection between DC1’s vCenter and DC3’s vCenter.At Rackspace, we have several SRM pairings around the world to go from one DC to another, and reconfiguration is something we readily handle for our customers.
It’s easier to build new SRM VMs to upgrade the guest OS version than to migrate between SRM pairings. To migrate to new SRM servers, say from 2008 R2 to 2012 R2, you just build the new machines and point the SRM installer to the same database via the ODBC connection. Of course, you’ll need to make sure your network ACLs allow communication, and possibly even the storage devices if you have them locked down. You’ll also have to check your Array Managers to make sure they’re connected. After the upgrade, you’ll need to edit the Array Manager to resupply the username and password used by the Storage Replication Adapter (SRA) so it will connect.
Because you are migrating to new SRM boxes, any customizations you made to the SRAs will also have to be replicated. For example, the NetApp and EMC RecoverPoint SRAs default to non-SSL on port 80 for communication, so a fresh install will require editing NetApp’s ONTAP config or running RecoverPoint’s command.pl to set it for port 443.
Keeping SRM up to date is simple
VMware is rolling out SRM updates and new versions quickly. I feel like I just deployed SRM 5.1, then upgraded to 5.5. Now 5.8 is here and 6 is just around the corner! Fortunately, the upgrade process is a piece of cake if you keep in mind the integrations you need to update along the way. That said, you can’t simply upgrade SRM by itself. You have to upgrade vCenter first, then update SRM. Ideally, you’ll go through the infrastructure items first, including SSO, Inventory, and the Web Client, then vCenter, and finally SRM before moving to the ESXi hosts. When I refer to vCenter, I’m referencing all of its dependencies as well.
The upgrade process looks like this:
- Upgrade protected/source vCenter
- Upgrade protected/source SRM
- Install newer SRAs – if newer versions exist
- Upgrade protected/source vSphere Replication appliance and any VR servers
- Upgrade recovery/target vCenter
- Upgrade recovery/target SRM
- Install newer SRAs
- Upgrade recovery/target vSphere Replication appliance and any VR servers
- Re-establish the site pairings (both SRM and VR)
- Upgrade *recovery/target* ESXi hosts
- Upgrade protected/source ESXi hosts
- Upgrade virtual hardware
- Upgrade VM tools
I’m sure you know why it’s necessary to upgrade the recovery site ESXi hosts first, but just in case, the primary reason is you don’t want to accidentally upgrade your VMs at the protected site to a virtual hardware version newer than the ESXi hosts at your recovery site. If you do, it may result in SRM being unable to import the VMs in a disaster. That would be a mess. By upgrading your recovery site ESXi hosts first, then the protected site, then the virtual hardware and VM tools last, it’s kind of a level set. Saving the virtual hardware version and VM tools until last is an added layer of protection.
After I upgraded SRM at both locations to 5.5.1 (I may have to write a follow-up upgrade to 5.8), I found it best to pair the connection again at both locations. It’s easy to do by going into SRM, selecting cancel when prompted to login to the remote site, and clicking configure connection. Do this for both sites to ensure DC1’s vCenter and SRM have accepted DC2’s certificate, and vice versa. When I ran the upgrade, I kept getting a random “invalid state” error, despite all services running. Once I reconfigured the connection at both sites, the error message disappeared.
Sharing our DR experiences is valuable
Wow, we’ve completely covered DR from planning and architecting to deploying, maintaining and upgrading! I hope you’ve been along for the whole series.
If you have any questions or comments about what we’ve covered or about your organization’s DR experience, I encourage you to post them in our community forums or in the comments below. I’m happy to help where I can! I enjoyed writing this series and hope to keep connecting with you on your DR journey.