Building a Disaster Recovery Solution using Site Recovery Manager, Part 3: Deploying

In my last post, I outlined how to architect a solid VMware vCenter™ Site Recovery Manager™ (SRM) solution. At the beginning of this series, I reviewed how to leverage SRM, and also discussed the importance of the proper planning needed before using SRM. In this post, I’ll cover how to actually build and deploy an SRM-based solution. A video on this tutorial is available below.

Before I walk you through the build process, let’s first establish the resources that you’ll need for the SRM solution. In this scenario, you’ll size the management VMs as follows:

  • vCenter Server Appliance – 2 vCPUs, 8GB of vRAM, 25GB & 100GB disks
  • SRM & vSphere Update Manager (VUM): Windows Server 2012 R2, 2vCPUs, 8GB of vRAM, 2x 64GB disks
  • VMware vCenter Operations Manager (vCOps) Analytics – 2 vCPUs, 9GB of vRAM, 12GB & 200GB disks
  • VMware vCenter Operations Manager (vCOps) UI – 2 vCPUs, 7GB of vRAM, 12GB & 120GB disks
  • VMware vSphere Replication – 2vCPUs, 4GB of vRAM, 10GB & 2GB disks
  • Database server – 4vCPUs, 32GB of vRAM, 64GB & 256GB disks

To accommodate the VM sizes above, you’ll need 14 cores if you want a 1:1 ratio of physical to virtual cores, 68GB of RAM and 928GB of storage. In an earlier blog, I mentioned having separate management clusters at both sites. These management clusters should be sized the same, therefore, you’ll need to have the same amount of resources available at both the protected and recovery sites. You could comfortably use two Dell R730s for the management cluster.  Why Dell you ask? I’ve had my hands inside Dell, Cisco UCS, HP, IBM and white boxes (blades & pizza boxes), and honestly, I just prefer their hardware to the others. For this exercise, I went with Dell for two reasons: 1) RAID1 on the SD Card and 2) their PERC (RAID controller) supports a hybrid setup of RAID arrays and pass through disks. This makes life easier if you want an internal/local datastore in a RAID configuration and raw disks passed through to ESXi for VSAN. One last suggestion – check with your hardware vendor to get the best RAM layout for performance to meet your needs.

You might be wondering, could l use VSAN for this? Absolutely! It’s pretty well known that you need at least three nodes for VSAN, but I’d recommend going with four. This allows you to place one node into maintenance mode and still maintain required minimum policies (or lose a node and let VSAN rebuild to meet those policies). Of course, you should also know VSAN recommends 10Gb Ethernet, with 1Gb being the minimum. With that said, I’d never recommend using 1Gb Ethernet with VSAN in any production environment.

Based on the management VMs above, I’d recommend the following hardware specs:

  • Processor – 2x Intel Xeon E5-2630 v3 octal core 2.4GHz processors
  • Memory – 16x 16GB DDR4 2133MHz ECC Registered RAM
  • Boot – 2x 16GB SD cards
  • Storage – 2x 800GB 6GB/s SAS SSD, 14x 600GB 15k SAS HDD
  • Networking – 2x 10Gbase-T & 2x 1Gbase-T in NDC slot plus 2x 10Gbase-T in PCIe slot

In raw storage space, this configuration gives you roughly 8TB per box, or 32TB for our vSAN datastore. Of course, the VMs will use more than that, but it’s still a good starting point that you can expand on as needed.  The default Failures To Tolerate (FTT) is one, even if no policy is set. With that setting, there will be a replica of every VM disk on your VSAN. Since 928GB of VM storage is required, plus 68GB of RAM, you’d actually need 996GB of storage, plus a few KBs for other files, logs, etc., you should add 10 percent for overhead, bringing the total to 1096GB. Add it all up and you’ll use a rough estimate of 2200GB of space (or 2.15TB). That leaves 30TB on the VSAN datastore, or roughly 15TB of VM disk space.

Alternatively, you could go with NetApp® SnappMirror® or EMC® RecoverPoint to handle the VM replication. Both have pros and cons, but in this scenario, VMware vSphere Replication (VR) and VSAN are better fits for this size environment. In larger-scale environments, SnapMirror or RecoverPoint would perform better.

In this environment, we have four nodes, each requiring four 10GbE cables, at least one 1GbE port for out-of-band (OOB) management, and possibly a second 1GbE port if you wanted to have a separate ESXi management port. You will leverage the optional ESXi management port. If you wanted to add even more HA, you could spread those four nodes across four racks/cabinets, each with dual PSUs, but that’s up to you (or your data center team). VSAN 6.0 is rack aware. Spreading them out and configuring VSAN properly will get you the desired HA or fault tolerance.

So let’s begin with our build!

As I describe building out the protected site, picture the recovery site being built out in parallel, with both sites receiving the same changes unless otherwise noted.

The four nodes have arrived in your data center, and have ESXi installed. If you’re starting from scratch, you’ll need some local storage to first build the vCenter Server appliance and create your VSAN cluster. It will be easier to set up the hosts before adding the cluster or enabling VSAN because you’ll first need to set up the vSwitch, port groups and vmks properly.

Now import your vCenter Server appliance (VCSA). Your configuration will be (or at least should be) custom to your environment, so I’m not going to touch on that. After it’s created and online, log into the vSphere Web Client and create your cluster, then import your four new hosts. Before enabling VSAN, you’ll need to first set up some of the networking.

You’ll set up a vSphere Distributed Switch (VDS) and leave one local vSphere Standard Switch (VSS) for your networking. On the VDS, you’ll create two port groups and split the four 10GbE ports between them. Next, you’ll create a vmk on the first port group for vMotion and management traffic. The second port group will be only for VSAN with one vmk dedicated for VSAN.

To keep this scenario simple, you’ll use a single VSAN port group with both 10GbE NICs and a single vmk. Multiple vmks on the same VLAN/Subnet are not supported and don’t provide a performance increase. Make sure you enabled Jumbo Frames! If you’re spanning switches, you’ll need to make sure multicast (IGMP snooping) is enabled and properly configured.

The local VSS will have one port group backed by only one physical NIC. There, you’ll create an additional vmk for management traffic only. This way you should have multiple paths for management, preferring to use the 1Gb vmk with the 10Gb vmk as a backup.

Any VM port groups that are needed can be created and backed by the non-VSAN pNICs. I know best practice is to separate vMotion as well, but we don’t have the ports available for that. You could technically cable up 2x 1Gb ports instead of just one – Dell’s mezzanine card has 2x 10Gb and 2x 1Gb – and use them for management and vMotion, but why not use the 10Gb ports if they’re available?

Now edit the cluster settings and enable VSAN. Which setting is best, automatic or manual? That’s up to you. Personally, I would choose manual because I’m a control freak and want to manage how everything’s set up. Once the VSAN cluster is enabled and online, you should see the datastore shared across the cluster and everything should be working. As far as the policies, I’d leave the FTT off (remember default is 1 already). You could create a striping policy for your database server or other high-IO VMs.  VSAN storage policies are applied to the VM, not the overall datastore, so you match policies to specific VMs as appropriate.

You will need to provision two Windows OS VMs – one for SRM and VUM, and one for the database server. Hopefully you have a Windows template you can deploy from or some other automated provisioning process. You’ll need to deploy the database server first for VUM & SRM databases. Next you’ll deploy VUM, which still requires a 32-bit data source name (DSN), followed by SRM. When installing SRM, make sure the database password doesn’t include a double quote (“) and doesn’t end in an exclamation point (!), as those will break the installer.  Also, if you’re choosing to relocate the installation directory, make sure there are no spaces in the directory path, as that breaks the vSphere Replication installation. This is true for version 5.5 and v5.8, but it’s untested in v6.0.

Now that you’ve deployed vCenter Server, SRM, and VUM, you can deploy vSphere Replication (VR) and vCOPS.

The foundation has been laid and you can add additional VMs like Active Directory (AD) servers, web and app servers, databases, and others. The next decision you’ll need to make is which VMs to protect with SRM. Take a look back at my first post in this series for more information on how to determine which VMs to protect with SRM, and which VMs you’ll want to skip (e.g. databases and Active Directory).

Assuming you know which VMs you’ll protect by SRM, you first have to configure replication for each one. Create three protection groups for apps categorized as either tier 1, 2 or 3. Although AD and databases are tier 0, they will not be included in the SRM protection groups.

Next, you should create four recovery plans. Why four you ask? There is one for each tier since the VMs can only be a member of one protection group. There’s also a fourth plan for a full failover containing all protection groups, since they can be a member of multiple recovery plans. These plans should be set to failover into our production networks at the target site, and the test networks should also be set to production networks. Why? Because testing SRM should involve testing your entire DR run book, not just SRM functionality.

In the tier 1 protection group, you need to set the boot priority to 1. Similarly, we need to apply boot priorities 2 and 3 to tier 2 and 3 respectively. You might wonder why do all tier 1 VMs need to be in the same priority group?  It’s because the boot priority is an attribute of the VM, and follows the VM across all recovery plans. By setting it up this way, when we run the full recovery, our tier 1 VMs will boot first, then tier 2, and lastly tier 3.

That was fun!  Long, but fun!  Of course, there’s nothing quick about architecting a DR solution, but in the end, it’s quite rewarding

Keep an eye out for my next post on maintaining and upgrading!

Luke Huckaba is a 2015 vExpert, Virtualization Architect and specializes in VMware products and works heavily in the Site Recovery Manager (SRM) product. He is a VMware Certified Professional (VCP), writes custom automation PowerShell/PowerCLI scripts, was the first user presenter at the San Antonio VMware User Group (VMUG) and now the SATXVMUG lead. Before finding his way home to Rackspace, Luke was an Infrastructure Architecture Engineer focusing on a robust disaster recovery solution and resilient VMware infrastructures. Luke has also collaborated with other VMware users around the globe to help build solutions in others’ environments.


Please enter your comment!
Please enter your name here