Rackspace is the hosting provider of choice for both small mom and pop businesses and Fortune 500 companies alike. But who hosts the hosting provider? I sat down with Victor Palma, a web scale engineer on the Rackspace Critical Sites team to find out how we host our website to ensure it doesn’t go down.
The Critical Sites service offering is part of the Managed Hosting solution at Rackspace, and it is specifically designed for companies, like Mazda, where an online presence is mission-critical. This offering has an aggressive Service Level Agreement that includes:
• 100% Production Platform Uptime Guarantee
• Five-Minute Notification Guarantee
• 1-Hour Hardware Replacement Guarantee
• Double Money-Back Payout
In setting up our Rackspace.com configuration, the Critical Sites team prepares for several types of challenges that our site may encounter.
Changes to the Site
“One of the biggest challenges is that our site changes often. We have a talented development team and they often change the look and feel for the site. For example, during Halloween the site was skinned with zombies, werewolves and vampires to fit the theme of Halloween. However, you have to ensure that when you make changes that those changes are seamless,” Palms explained.
As more changes are introduced to a site, the potential for those changes to have an adverse effect increase. One of the things that Rackspace does is Release Engineering, which is based off of both Webistrano and Capistrano.
“This is a Graphical User Interface (GUI) where the developers can input code changes themselves without having to wait for system administrators,” Palma said. The changes are versioned so that Rackspace can always revert back to the state before a change was made.
Furthermore, Rackspace has a QA and Demo environment that we use to look at changes before they hit production. These environments closely resemble our live environment so that we can ensure that any updates to the website or backend perform properly before being deployed.
Increase of Traffic
If a company was to have a popular Black Friday sale or run a Super Bowl advertisement, they might experience an increase of traffic based on higher than usual demand. This is an example of a “good” type of increased traffic. However, the site could also see an increase in traffic that is malicious, such as a Distributed Denial-of-Service (DDoS) attack. Rackspace.com has to be prepared for both types of increased traffic.
To help understand if Rackspace.com is receiving a higher than usual amount of traffic, the Critical Sites team uses several tools including Spectrum and eHealth.
“We monitor everything from load balancers, firewalls and switches,” Palma said. “Spectrum is like a single pane of glass that allows us to see all the alerts, such as if we begin getting a lot of traffic, or if our servers are hitting a high amount of connections.” This could trigger either major or minor alerts, and the Critical Sites team uses their expertise to diagnose and interpret those alerts.
In conjunction with Spectrum, Rackspace uses eHealth to perform trending on the environment to look at unusual connections for particular times of the day. “These tools are very sophisticated and can trend the normal behaviors for a given time. For example, eHealth will recognize when we are performing a scheduled backup and will not send us an alarm when the CPU spikes up.” Additionally, eHealth recognizes when there is not enough traffic, allowing the team to verify whether there is an appropriate load or if something abnormal is occurring.
If the Critical Sites team determines that the traffic is malicious, they use a DDoS Mitigation tool to help filter hostile traffic. The tool essentially drops the malicious traffic while ensuring that the legitimate traffic reaches its destination. This DDoS Mitigation tool helps ensure that our website can stay up, even during the course of an attack.
When a hardware failure occurs, we may have to take a server out of rotation. There are a couple of ways that we ensure Rackspace.com can remain up in the event that we drop a node.
At the datacenter level, the Critical Sites team has created a High Availability (HA) configuration for our site. “The HA config is set to have redundant load balancers and firewalls. If a machine does fall out of rotation, we have enough system resources in a web farm to handle a high amount traffic.”
Additionally, the team has setup an environment to ensure geographical redundancy. This allows them the opportunity to failover to a completely separate datacenter in the event of catastrophic failure at the primary datacenter.
Testing and Disaster Recovery Plan
While there are challenges with hosting Rackspace.com, Palma says that it is important to test the environment and have a disaster recovery plan. The Critical Sites team uses SOASTA to perform load tests of our configuration. “We need to understand how much traffic we can handle. The team has created a configuration that should handle a certain amount of theoretical traffic, but it is important to prove that it can indeed handle that actual amount of traffic.”
Palma emphasizes not only the importance of having a disaster recovery plan, but also testing that it indeed works. “Testing covers a lot of ground, it verifies that you have the right data, that the plan works if the event comes up and is a way to ensure business continuity.”
Rackspace.com Powered by Fanatical Support
There is a lot of technology that powers Rackspace.com and keeps us up and running, but one key ingredient are the Rackers who care for our site. Fanatical Support fuels Rackspace.com in the same way that it powers our customers.
“The tools are only as good as the people who stand behind them. It’s like having a good car – you might have the fastest car out there but if you don’t know how to drive it, then it’s just a regular car.”