At Rackspace we pride ourselves on Fanatical Support in all we do. The public face of Fanatical Support is having knowledgeable people ready to help customers when they need it. But behind the scenes, Fanatical Support means leveraging technology to improve every aspect of the customer experience.
The Rackspace Cloud Monitoring team has been working on ways to improve the lives of our customers by developing tools that will not only deliver a better support experience, but will allow us to deliver the features that our customers need quickly and reliably while avoiding service interruptions. Today we are open sourcing Dreadnot, a piece of technology that enables the continuous deployment of software.
Most people who come from a background in large-scale software have a horror story about a failed deployment. As a reaction to this, many companies have procedures for releasing new builds and manually testing them. The Cloud Monitoring team chose a different path.
Rather than deploying less frequently with more manual testing, we deploy more frequently, relying upon a culture of test-driven development, code review and extensive quality assurance automation to catch bugs early and minimize service interruptions. Our maxim is that a new engineer should be able to push code into production on their first day on the job.
This helps us bring the Fanatical Experience to our product by ensuring that the time between a reported bug and the production bug fix is as short as possible ensuring a continuous flow of improvement to our software. One might assume that extra time was wasted in building more sophisticated testing and infrastructure, however it pays off with more efficiency in other areas.
For example, when dealing with slower deployments, there is a constant overhead associated with diverging release and development branches, which becomes more complicated as more members work on new features simultaneously.
As the change surface of the release increases in size, assuring that any pushed build contains the desired fixes and features without breaking old functionality becomes more difficult. Smaller incremental builds make it easier to track down production problems because they contain only a small number of features, instead of worrying about which of the twenty features in the release actually broke.
Continuous deployment doesn’t have to mean constantly dumping our code to all our servers and waiting for something to break. Since its inception, the Cloud Monitoring product has been designed to withstand widespread system failure. Our motto is “First to know. Last one standing.” Monitoring can’t be allowed to fail during a major data center failure – disaster scenarios are when our customers need monitoring to work the most. Rackers around the globe work to ensure major failures are as rare as possible, but the monitoring system must always be prepared.
Much of Cloud Monitoring’s resilience comes in the form of cross-region redundancy. Data points are gathered from five data centers around the globe, and every data point is independently processed in three. This gives us valuable options when it comes to deploying. We can take a data center offline with no customer impact, upgrade services running there, then bring it back online while carefully monitoring the impact of the upgrade.
Of course this doesn’t mean we can push broken code without causing problems, it merely increases our chances of detecting certain classes of issues before they impact customers.
One key to maintaining our developer culture is having a single canonical way of doing things. There should be one correct way to run tests and one correct way to bundle our code. That is not to say each of these processes is simple; a cross-region rolling deployment is a many step process. But as engineers, what do we do when we encounter repetitive multi-step processes?
The Cloud Monitoring team started by using Etsy’s Deployinator, but it didn’t meet our needs perfectly. The Deployinator was developed for a single region product, and took some shortcuts, but the basic ideas were sound. We were also looking at using Deployinator for multiple products inside Rackspace, and each team was faced with creating many customizations in Deployinator to fit the models we desired. Due to this, we developed a new project that we are open sourcing today called Dreadnot.
In Dreadnot there is the concept of a Stack, a series of tasks to deploy a specific piece of software. The Stack defines how a deployment works and what code is being deployed. For example, in Cloud Monitoring we have one stack for our monitoring pollers and another for our API services.
Under each Stack is a set of Regions. For each Region we track the currently deployed version of a stack with the most recent version available on Github.
Under each Region you can see the complete history of a deployment in that region. For an individual deployment, you can go back and view the entire log with all the details, or view the diff link for the changes that happened in Git.
We built in a deep integration with our infrastructure to ensure both a high quality product and a seamless user experience during the deployment of a Stack. Dreadnot finds the target revision SHA we wish to deploy from Git, and then talks to Buildbot about that specific revision. It then ensures that all test cases have passed in Buildbot, and that a tarball has been generated for this revision. If these builds haven’t started, Dreadnot will trigger Buildbot to build them, then wait for Buildbot to complete the tests and make sure the release tarball is available.
Once the build is tested and ready, Dreadnot reconfigures the load balancers in the target region. Using the balancer-manger feature of mod_proxy, it drains requests to the local API servers and sends all traffic to API servers located in a different region. This temporarily increases the request latency for some customers, but they experience zero downtime at the HTTP API level.
Dreadnot then modifies a databag in our Chef server that references the revision it built for this deployment. The software then uses a parallel SSH and execute
chef-client on the machines in the desired region. Dreadnot uses a triggered
chef-client command instead of using daemon mode because we wanted it to control exactly when other non-code changes are made. Both code and configuration management changes introduce risk into the environment. The Cloud Monitoring teams believes the best time to roll out Chef recipe changes is when customer traffic is already shifted to another region, so we wanted to treat recipe changes similar to a code deployment.
Our Chef recipe downloads the remote tarball from our build servers, extracts it, updates a symbolic link and begins restarting services. Once all of the
chef-client runs are complete, Dreadnot runs tests against the upgraded servers and validates that the new version is running successfully.
If any of these steps fail, Dreadnot will stop to wait for human intervention and continue directing traffic to another region. Dreadnot was developed to assist with the most common multiple region deployments. However, for the complicated deployments, or those deployments that experience a fatal error, you can proceed manually without interference from Dreadnot.
Assuming everything worked Dreadnot then reconfigures the load balancers to bring back traffic to the region it just upgraded. This process is then repeated for the remaining regions.
We handle staging and other environments by giving them completely isolated and separate Dreadnot instances. We chose to partially do this for security reasons in addition to preventing accidents, so that testing and staging infrastructure are completely isolated from production.
Dreadnot is open sourced under the Apache License version 2.0, and we hope it can be useful in deploying your own projects. Rackspace has started using it on two different product teams inside the company, so while there are still some areas that could be made more generic and more features to add, we believe Dreadnot is at a good starting point. Our team would love to see ideas from the community and pull requests to help make Dreadnot more helpful to everyone.
The Rackspace Cloud Monitoring team is fanatical about continuous deployment. We love being able to iterate quickly on our product and believe that our customers will get the best experience possible by doing so. If you would like to try out Rackspace Cloud Monitoring product, be sure to fill out a the Private Beta application survey. Additionally, we are hiring folks folks who are interested in solving these kinds of problems from the inside.