Building A Disaster Recovery Solution Using Site Recovery Manager Part 2: Architecting

In my last post, I described how to use VMware® vCenter Site Recovery Manager to orchestrate the failover of a group of virtual machines (VMs) from one location to another. I also discussed the importance of planning when it comes to using Site Recovery Manager.

This post outlines how to properly architect a solid Site Recovery Manager solution.

Getting Started

If you followed the planning post properly, several of your ducks should already be in a row. You’ve identified which apps should be protected and established some form of ranking system—I’ll call it tiering. You are aware of all of the integrations with those apps and those integrations are fully documented with associated designs.

Click diagram to enlarge and zoom:

Here is a list of additional things that you need to know before getting started:

  • Total protected VMs
  • Total size of replicated data
  • Number of networks needed
  • How many different tiers and apps you have
  • A good idea of the total resources you’ll need at your recovery site

If you remember from the planning post, you should expect to have the same amount of storage at both sites. You may use fewer physical resources at the recovery site to reduce hardware costs, if needed. However, you may experience degraded performance if you move to a higher consolidation ratio than at your protected site. The over-allocation amount is up to you. There is no specific sweet spot—just consider the workloads you have running in your environment.

vCenter Installations and Requirements

In a perfect world, both sites have a management cluster where all VMware vSphere VMs and integrations are running.

Simple Install

During the “Simple Install,” all services are installed on a single VM together with vCenter. (The minimum requirements are 2 vCPUs, 12GB of RAM and 60GB of disk space [100GB recommended]). The Simple Install includes vCenter Server, vCenter Single Sign-On (SSO), vCenter Inventory Service and the vSphere Web Client. The disk space requirement doesn’t include the database, so if SQL also will be installed on the same VM, you’ll need to add the resource requirements for it.

If you are considering the vCenter Appliance—a nice all-in-one option that supports up to 100 hosts and 3,000 VMs using the internal database—you should be aware that it’s a Linux appliance. That means you’ll still need a Windows VM to install Site Recovery Manager and VUM.

Custom Install

If you plan to perform a custom install and put vCenter on its own VM, a basic install of vCenter by itself requires 2 vCPUs, 4GB of RAM and all of 4GB of disk space. Whether these are the right requirements for your installation depends on the size of your environment because the minimum is for 50 hosts and at-most 500 powered-on VMs. VMware recommends 4 vCPUs, 8GB of RAM and 10GB of disk space for what it calls a large deployment of 300 hosts and 3,000 powered-on VMs. VMware recommends you double those CPU and RAM requirements for an extra-large environment of 1,000 hosts and 10,000 powered-on VMs. Remember, this is for installing vCenter alone. It doesn’t include installing vCenter SSO, vCenter Inventory Service or the vSphere Web Client.

vCenter SSO requires 2 vCPUs, 3GB of RAM and 2GB of disk space if you plan on installing it on its own machine or building a High Availability (HA) set up—which I do recommend! Additionally, vCenter Inventory Service requires 2 vCPUs and 3GB of RAM, but disk space will vary from 5GB to 60GB. A typical install of vCenter Inventory Service requires 15GB of disk space, and for highly active environments—such as VDI with lots of VMs being spun up or deleted—you will want to make the investment in larger disk space. If you have more than 400 hosts or 4,000 VMs and you have lots of VM activity (vmotions, power on, power off, etc.), then I strongly suggest you opt for 60GB.

vSphere Web Client requirements are very similar: 2 vCPUs, 2GB of RAM and 2GB of disk space. The difference is Web Client can be tuned by the Java heap size. VMware recommends 1GB of JVM heap, but I’ve set it higher for better performance. It really depends on your environment’s size.

And don’t forget that these requirements are in addition to your OS requirements.

So to recap…

  • For a Custom Install – You need 7 VMs for vSphere capabilities (vCenter, SSO, Inventory, etc), each with a minimum of 2 vCPUs and 4GB of RAM, plus disk space. You still need a database VM, bringing your total to 8 VMs. Now, you’re looking at at least 16 vCPUs and 32GB of RAM.
  • For a Simple Install – You need one VM for vSphere (assuming you install Site Recovery Manager on it, as well). You still need a database server, but the Simple Install reduces management overhead. Your single VM may be pretty large, but still manageable.

So should you do Simple Install instead of Custom Install? Well, I leave that up to you to architect. If you have no requirement to separate roles, then don’t over-complicate things; it’s plenty complicated on its own!

Site Recovery Manager Installation and Location

Now that you have a clear picture of your vSphere requirements, you need to size your physical hosts accordingly. I recommend that you build a separate management cluster for all things vSphere and its associated dependencies. At a minimum, you’ll want two hosts. Why? For HA, of course! You can’t have HA without two hosts.

In this cluster, you should have all of your VMware components, as well as databases and any other management VMs that you want to keep separate from your standard workload or compute cluster. If you don’t want to have a management cluster, you can co-mingle everything. It’s just cleaner and easier to manage when your key infrastructure VMs are separate.

For example, imagine that your single cluster is 16 nodes and vCenter dies. You have 16 possible nodes where that VM can live. With PowerCLI, you can connect to all 16 hosts at once and issue a reboot to the VM (BT,DT). But if your management cluster is just 2 nodes, it’s easier to locate that vCenter VM without vCenter.

Storage Implications

There’s one more thing for you to do at this point (if you haven’t done so already) and that’s to contact your storage vendor and ask for a deep dive into how its replication works—not just during normal operation, but during an Site Recovery Manager test. I mentioned this in my planning post, but it’s worth noting again. EMC actually splits the journal into two parts when running a Site Recovery Manager test because it grants access to a snapshot of the replicated data and runs deltas in the journal. While this is running, changes are still coming in and stored in that journal, and it’s a hard split between those two on the journal. In contrast, NetApp can share space on the filer so there’s no hard-stop limit (unless of course you completely fill it up!).

In your design documentation, you should specify any replication traffic and where it will live as this will help when you deploy the actual solution. Don’t forget, you’ll also need a placeholder datastore. It can be small—1.3GB—because placeholder VMs are only a few KBs each, so 1,000 of them won’t take up but a few MBs.

In the planning post, I also should have recommended that you separate your VM swapfiles from regular VM files. When you do power on a VM, a .vswp file equal to the size of the VM’s RAM is created. Chances are, you’re not swapping at the host level. If you are, stop reading. Go add more RAM to your cluster then come back. Why does this matter? If you’ve allocated 128GB of virtual RAM to powered-on VMs, that’s 128GB of data on your replicated datastore that has to be replicated as well. Configure your cluster to allow the hosts to specify the location then go to each host and configure a vswap datastore.

How much should you allocate for vswap? Well, again, that’s up to you. The rule of thumb I use is double the total physical RAM available in a cluster. If you have four (4) nodes, each with 128GB of RAM, that’s 512GB of physical RAM, so a vswap datastore of 1TB will likely suffice. In order to fill it up, you’ll have to allocate 1TB of RAM to powered-on VMs. That’s basically 64 VMs with 16GB of RAM each, so filling it up can be easier than you think. In the virtual world, we overprovision all of the time, so keep that in mind, as YMMV (your mileage may vary).

If your storage team colleagues haven’t already, they need to create the following LUNs or Volumes now:

  • Datastore(s) for non-Site Recovery Manger VMs
  • Datastore(s) for Site Recovery Manager VMs
  • Placeholder Datastore
  • vswap Datastore

They also need to either create a journal LUN with adequate space for testing or create a volume with enough snap reserve.

In your architectural documentation, be sure you have identified how the VMs will be broken down. If you’ve separated them into tiers and plan to fail them over independently, you’ll need each tier on its own LUN or Volume. Make sure this in place now, before proceeding. If you’re using NetApp, name the replicated volumes with “replicated,” “snapmirror,” “Site Recovery Manager,” etc., but do not include any form of those in the non-replicated volumes (i.e., “non-Site Recovery Manager”). When you actually install everything, you’ll use this to your advantage.

Network Implications

Be sure you also have a list of port groups at the protected site and how they correspond to the port groups at the recovery site. You will need to keep up with this key piece as you add networks to your environment.

A Complete Architecture

You know how many VMs you’ll need for your vSphere solution and the total resource requirements. You know how many physical hosts you’ll need and how they’re laid out. You’ve documented how your storage will be carved up, and lastly, you have a clear picture of the network at both locations and how they correspond to one another. Your architecture is in place.

Next comes my favorite part—actually building this out!

Stay tuned for Part 3 of “Building a DR Solution with Site Recovery Manager.”

Luke Huckaba is a 2015 vExpert, Virtualization Architect and specializes in VMware products and works heavily in the Site Recovery Manager (SRM) product. He is a VMware Certified Professional (VCP), writes custom automation PowerShell/PowerCLI scripts, was the first user presenter at the San Antonio VMware User Group (VMUG) and now the SATXVMUG lead. Before finding his way home to Rackspace, Luke was an Infrastructure Architecture Engineer focusing on a robust disaster recovery solution and resilient VMware infrastructures. Luke has also collaborated with other VMware users around the globe to help build solutions in others’ environments.


Please enter your comment!
Please enter your name here