When I tell people that I work with Erlang at Rackspace they generally assume that I work on cloud offerings or OpenStack. In fact, my team, the Foundation Development and Automation Team, is part of the Infrastructure Services group and we support the traditional dedicated hosting business. As far as I know we are unique within Rackspace as the only team with a mission haiku:
Adapt and collaborate
Essentially, we build automation tools and APIs. My project team focuses on automating network devices. The basic tasks we support include backing up devices, pushing out updates to large groups of devices and generating base configurations for new devices. Recently, we have been working on more advanced tools: the API that backs the Firewall Manager on MyRackspace.com and the APIs for pulling topology information from switches and changing VLAN assignments on switch ports. Our earlier work was focused on providing full-stack automation tools, whereas now we are providing APIs for others to build tools upon. This change in focus allows us to work more efficiently, integrate with other automation efforts within the company and grow with the business.
There are currently about 50,000 network devices in our database. Breaking that number down a bit, we have firewalls, load balancers and switches. The single largest group of devices is the infrastructure switches, which account for more than half of the devices we manage. These devices are spread across eight datacenters in North America, Europe and Asia.
There are three challenges that we face when automating these devices. The first is performance at scale. Our responsibility for these devices differs depending on the type of device and its role in the network infrastructure. We don’t, for example, back up all 50,000 devices every night. That having been said, working with even a significant fraction of these devices requires a high level of parallelism. Some of the devices are quite slow and the time spent either waiting for the device or in network I/O is the primary bottleneck. Since we cannot speed up the communication with individual devices we need to communicate with several devices at once to speed up the overall process.
The second challenge is dealing with the differences between devices from different vendors and even differences between code versions on the same hardware. The lowest common denominator of device automation is logging in over SSH or Telnet and interacting with a command line session. We can also utilize SNMP, but the devices vary significantly in their level of support. Vendors will supply their own management interfaces and APIs that are often better than the alternatives, but if we utilize them we must do so in a way that provides a uniform interface to our clients.
The final challenge we face is transparency. If we cannot communicate with a device, or if an automation process fails, we need to know about the failure and, if possible, why it occurred. We also need to keep a record of all device interactions for troubleshooting and auditing purposes.
When I joined the team we had several Ruby on Rails applications that provided the basic features we needed. Unfortunately, those applications had significantly overlapping feature sets, but were information silos. This made it difficult to add features or fix bugs, especially since we were a very small team. The first step was to combine the smaller applications into one comprehensive application. This eliminated a significant amount of code duplication and allowed our team to be more effective.
At this point we also replaced our MySQL database with MongoDB. While MySQL is a fine product, we felt that it was introducing friction into our development process. As a small team we needed to be able to focus on solving the core business problems and MongoDB allowed us to do that in a way MySQL did not.
We still had issues with scale; the Ruby applications just couldn’t handle communicating with a large number of devices in parallel. The next step was to take the core Ruby code that we used to talk to devices and wrap it in a robust framework written in Erlang. This framework, which we call FireEngine, allows us to talk to a large number of devices at once, with more transparency than we have ever had. Our Rails application is now just a UI for the information in our database, while FireEngine does the heavy lifting of communicating with the devices on the backend.
There are three pieces of code that we developed while working on FireEngine that we are especially proud of. The first is a library that evolved from the original device communication code. We still have to communicate with a lot of devices by automating a command-line session. This library has evolved to the point where it can handle SSH1, SSH2 and Telnet transparently. Client code doesn’t need to know how to communicate with the device, only what to do once communication is established.
Earlier I mentioned that we use Ruby and Erlang together. The second piece of code is an Erlang library that allows us to seamlessly call Ruby functions from Erlang code. The code is simple, leveraging some basic features of Erlang to start an instance of Ruby and send it instructions, but it is very powerful. We are able to use Ruby where it makes sense within our Erlang application.
The third piece of code is a batch processing framework. Whenever we need to perform a large job, such as backing up all the customer devices in a datacenter, we use this library. Each job consists of a coordinator, a configurable number of asynchronous workers and callback module which specifies exactly what we want each worker to do. If a worker crashes the coordinator logs the event and starts a new one to replace it. We can scale the number of workers up or down as necessary and since the logic for a job is specified in a simple callback module we can create new types of jobs very easily.
The results of our work have been very satisfying. Device interactions are now fast and reliable. Our largest datacenter, with just fewer than 7,000 customer network devices, backs up in 39 minutes with very few errors. With the added transparency we know not only when a device fails to communicate, but we also know why. All device interactions are logged with details of what happened. This has proven to be an invaluable diagnostic and auditing tool.
We have made some bold technology decisions over the last couple of years to meet the challenges we faced. However, these decisions were not taken lightly. We had to make a convincing argument to management and then prove the technology with prototypes and working code. The decisions have proven to be the right ones and we are now ready to meet the next generation of automation challenges within Rackspace.
Phil Toland presented this topic at a talk at the Erlang Factory at the end of March as a case study for how Rackspace uses Erlang. Read Phil’s previous post, Overview of the Erlang Programming Language, and if you are interested in working on Erlang at Rackspace, check out the relevant job postings.