Rackspace Cloud Files is a massively scalable cloud object storage system built using OpenStack Swift. As with other OpenStack projects, Swift is written in Python to take advantage of its rich technical features, fast delivery of capabilities and quick ramp-up time for developers to contribute.
As Cloud Files grew, however, we started to experience a number of scaling problems and limitations with Python. At the OpenStack Summit in Tokyo on Tues., Oct. 27, Rackspace Principal Engineer Michael Barton and Senior Software Developer David Goetz will describe how Hummingbird, written in Go, has dramatically improved performance in the two production environments where it’s deployed so far, in our Virginia and London data centers.
We hope you’ll join us for the session. But for those who can’t make it, read on to understand the problems Swift was experiencing, and see just how much Hummingbird has improved performance.
Rackspace started work on the Swift Storage Project in 2009 and launched it in 2010 alongside the Nova Compute Project as part of the original OpenStack platform. Swift is broken into four main services, one public facing and three internal. This is necessary for a horizontally scalable product and also provides for a lot of flexibility when architecting the hardware deployment for swift. The four main services are:
- Proxy Services (public facing)
- Object Services
- Container Services
- Account Services
Since launch, Cloud Files has grown to storing hundreds of billions of objects for hundreds of thousands of customers in six global datacenters. It is used as a standard object store as well as a backend for Glance, the OpenStack Image Service Hadoop, various software backup solutions, and the origin for a large CDN deployment. The team has pushed Swift to its limits and while it has performed amazingly well over the years, it has gradually and steadily begun to experience a number of scaling problems that have become increasingly complex to fix.
A Swift History of Running Object Storage at Scale
Rackspace open-sourced Swift in 2010 and the Object Services layer, namely the object server and replicator, have remained relatively unchanged since they were originally written. As Cloud Files grew it started to experience a number of scaling problems and limitations with Python. For example, Python has limits supporting concurrency and disk I/O and this can create bottlenecks at the object server layer.
Many different approaches were considered, such as configuring a larger number of object workers, modifying concurrency settings and building in event loops. When a node begins to fail at the hardware or disk level, the Python object service would exacerbate the problem. The object server could become completely unresponsive, or equally bad, continue to accept connections but serve the data very slowly. The result was a percentage of requests taking a long time to complete. We found ourselves spending a huge amount of time running and optimizing the service at the expense of new features.
To overcome the operational issues we were experiencng, Michael Barton a principal engineer on the Cloud Files team, began experimenting with rewriting the Object Server in various alternative language frameworks. Initial benchmarks for Go were very promising, so Barton, along with David Goetz, a senior software developer on Cloud Files, presented a case to Rackspace Engineering leaders seeking resources to pursue rewriting and deploying Object Services in Go, with the project codename Hummingbird.
Once the team got the go-ahead, our next steps were to get Hummingbird functionally equivalent with Swift from an API level and create a test suite to compare at an Object Server and Object Replicator level. Next, we deployed to single production node in our Virginia data center, extending to a 4-node zone, then half the cluster and finally the entire region. This was a purposeful, slow release taking advantage of Swift’s modularity over a number of months in mid-2015.
We have seen extremely positive performance improvements in the two production environments where Hummingbird is deployed today, namely our Virginia and London data centers. Below is a sample of data showing some of these improvements.
First, take a look at Read Timeouts. This a timeout triggered from our proxy server reading from our object layer while serving a GET request. A read timeout is the timeout on waiting to read data. In general, the proxy server is able to recover from this timeout by querying a separate copy of the object — but the client will experience a long lag of inactivity while waiting for the initial response. Specifically, if the server fails to send a byte 19 seconds after the last byte, a read timeout error will be raised.
Here we see a dramatic decline post deployment of Hummingbird in our Virginia region:
This dramatic drop was due in large part to two things. First the Hummingbird object server does not become unresponsive under heavy load the way the Python server would. On top of this, due to the single process server model we were able to use with Go, we were able to implement a simple but effective means of error limiting slow drives. This allows the proxy server to use its existing failover model, but allowing it to fail over almost instantaneously rather than waiting on the timeout.
The Go object server is also roughly twice as fast as the Python server, as you can seen in this graph, which ignores the very slow timed-out requests as outliers:
In general, our team has been extremely pleased with the results so far. We have taken a huge step toward providing a better experience using Cloud Files for all workload types.
We were able to pick up Golang quickly and port over a majority of the needed features in a relatively short amount of time and it is currently a feature branch to Openstack Swift. Another huge goal of the project is to begin the next generation of our replication framework. We’ve made a lot of progress, and we are looking forward to the next stage in development.
Want to learn more about Hummingbird? Check out the new feature branch on GitHub. Want to work on it? We’re hiring! Join Rackspace and help solve the complex challenges of running a massively scalable cloud object store.