Disaster Recovery in the Cloud

Whether you use the Cloud or dedicated servers, you should always make sure you have a plan for your configuration in the event that something goes wrong. This is a series of posts based on a discussion I had with Aaron Scheel, a solutions engineer here at Rackspace. 

Anyone who has used a personal computer has encountered a blue screen or the beachball of death. Put it simply, technology sometimes breaks. Therefore, it is important to understand how your business would respond when an undesired event occurs and that you have a Disaster Recovery (DR) plan.

While the term gets tossed  around in our industry to sometimes mean different things, the distinguishing feature of a DR plan is that a disaster actually occurred in your environment. This would translate to the fact that your environment is actually down, or experiencing some level of service issues.

There are three different ways that you can incorporate disaster recovery into your cloud configuration: image snapshots, file system/database backups and replication with manual failover.

Image Snapshots

The cloud offers a unique option to recover by recreating the server from what is called an image snapshot. The snapshot not only has your files and data that are located on the server, but also contains the configured OS and services that you have loaded.

With a click of a button, your server can be rebuilt with the particular OS and stack that you had installed along with all the data from the previous snapshot. This means that you do not have to go through all the configuration steps of the server to get it ready for your application and data.

There is one caveat to note. You can only restore from an image snapshot that is less than 160GB. With Windows, this 160GB is the total drive size whereas Linux the 160GB is the consumed data space on the server. The best practice, however, would be to keep any servers you wish to restore from a snapshot that the total drive size is 160GB or less so there is no way of going over; this corresponds to the drive size associated with 4GB RAM server instances.

“When you are dealing with data larger than 160GB, you are typically dealing with databases or large image file repositories. When you are dealing with databases, you have to start thinking about recovery time, and how long it takes to recover from the disaster. If we were to go beyond 160GB on an image restore, that’s going to be a very slow restore and will be on the magnitude of an unrealistic timeline for the restore, considering that a simple DB or file restore quite often is all that’s needed.” Scheel explains.

While restoring from an image snapshot can be a desired recovery method for companies having servers with less than 160GB of storage, there should be another DR plan that a company employs if they are near or over that amount, or companies that wish to employ a different recovery method even below that limit.

File System/Database Backups

Similar to traditional dedicated hosting, customers can run file system and database backups of their configurations. These backups can be scheduled anywhere from once a week or once a day, but you should strategically consider what to backup.

“You want to stop backing up the OS and unnecessary files and you start backing up the necessary flat files that are on your system. This reduces your backup time because you are not backing up the OS, and also helps to reduce restore times as they’re only restoring exactly what’s needed,” Scheel explains.

A key feature of file system or database backups is that they are not done in a real time manner. Due to this, file system and database backups are a strong option for customers whose data dose not change as frequently, or if the customer has a certain degree of tolerance for losing a day’s worth of data.

If there is a catastrophic event that you must recover from, you can restore your site from the file system and database backup, however, depending on the amount of data that you have to restore this option could take a while (another reason to be strategic on what you backup).

Additionally, you must reconfigure your servers to support your application, up to and including re-installing the application if a full server rebuild is required. Having a DR plan that relies solely on file system/database backups is best advised for customers who have a certain degree of tolerance for downtime.

Replication with Manual Failover

Customers who desire a quick restore time should consider setting up replication of both their file system and database. Replication ensures that all the data in both their file system and database is copied over to another server that can assume a role to serve up production traffic.

“It’s a business decision as to how often customers replicate the data. If the customer can take five minutes of data loss, then we try to sync every five minutes.,” Scheel explains. “For file systems we use tools like Robocopy or Lsync to try to get down even to the second as data changes. We use either Replication (Linux) or Mirroring (Windows) for databases.”

While replication does require additional servers to be put online, the cloud has enabled businesses to get this additional hardware in an economic way. If the customer has the tolerance for the amount of downtime associated with a manual failover, the servers that are housing the replicated data can be smaller than their production counterparts.

For example, a customer may want the performance of an 8GB RAM server for their configuration, but the amount of data that sits on that server could fit on the storage space of a 2GB RAM server. The customer could mirror the data on the 8GB RAM server to the 2GB RAM server. Taking advantage of the elasticity of the cloud, if the customer had to failover to the smaller 2GB server they could resize it up to an 8GB RAM server to achieve the same level of performance, requiring simply the amount of time necessary for the resize to be queued, started and processed.

“That brings their DR up to an 8GB level and they see the same performance they saw on their previous server that just died. That essentially gives them a running replacement that was down with data in place. There is a time window to take into account for the resize, so if that is not acceptable, they can simply run another 8GB VM to replicate to,” Scheel says.

What’s Right for Your Business

Different businesses have different needs, and our solutions engineering Rackers have the expertise to guide you to the best option. To understand what solution works for you, it would be best to give us a call and chat with one of these Rackers in more detail.

Are you looking for more information? Be sure to read our previous post where we discussed some of the high level differences between Disaster Recovery and High Availability in the cloud. Check out our post next week where we will discuss High Availability in more detail.

Aaron Scheel is a solutions engineer Racker who has advised a number of our customers with strategies to ensure that their businesses have the ability to withstand an adverse event.

Garrett Heath develops content and supports customers on the Rackspace Social Media team. His previous experience includes technical project management in the cloud, content marketing and social media marketing. He enjoys writing about how the cloud is spurring innovation and telling stories about the people behind the tech. You can also read his work at MarketingBytes.io. In his free time, Garrett writes about food and local San Antonio culture at SA Flavor.



Please enter your comment!
Please enter your name here