Earlier this year, Rackspace Cloud Metrics moved our production system from running on virtual cloud machines to running on Rackspace OnMetal. The outcome is a more reliable system that is — miraculously — cheaper.
Cloud Metrics is a multi-tenant software-as-a-service (SaaS) that offers a flexible and affordable platform for storing and serving time-series metrics. It provides a REST API for metrics ingestion and retrieval. In addition, it provides out-of-the-box integration with popular open-source tools like statsd, collectd and Grafana. The software that powers this service is an open-source project named ‘Blueflood,’ built on top of Apache Cassandra.
For our operational needs, Rackspace Cloud Metrics requires a large, stable cluster of machines with gobs of storage space. We worry about scale and stability so our users don’t have to. Our focus on scale and stability is precisely our motivation for moving to OnMetal.
We spent a lot of time tending to the needs of our previous cluster. To our customers, the service was relatively stable, but at the cost of much manpower. In turn, this cut into our engineering efforts to improve the system. For a couple of reasons which I’ll explain below, we expected OnMetal servers to improve the reliability of our service and give back some much-needed engineering time.
We had long coveted the computing capacity available with OnMetal but expected it would increase our costs. However, when we did the math we found it was actually cheaper (yes, cheaper!) than our existing setup. At this point we realized we had to move to OnMetal.
Our original footprint
Before we moved to OnMetal, we operated on Rackspace 60G ‘Performance 2’ servers that are now referred to as ‘I/O Virtual Servers’ — virtual machines running on Rackspace’s OpenStack cloud. We also had a good amount of Cloud Block Storage (CBS) attached to each server.
The performance of our virtual servers was more than enough for our needs. In fact, Cassandra rarely presented much load for these machines, and the servers were generally pretty bored. The rationale for overprovisioning CPU and memory was to get the 600GB of storage space that these machines provided.
But even these machines didn’t provide enough space for our needs. Rather than spin up an endless supply of servers, we augmented the storage that came with these machines with another 800GB per server of SATA Rackspace Cloud Block Storage. We connected everything with Rackspace Cloud Networks and had a big, giant cluster for our metrics.
Problems with the old cluster
We operated the cluster in this split state for quite some time: some of the data locally, some of it on CBS. But this setup made the cluster very sensitive. Any blip in our Cloud Networks might cause a disconnect with CBS which would put a Cassandra node into a bad state. We could always recover from these issues — Cassandra is very good for that reason— but it caused a lot of unnecessary work for our team.
We also realized that with all of our Cloud Block Storage, our footprint was actually more expensive than a similarly sized cluster of OnMetal I/O nodes. At consumer prices, the cost of our 60GB servers + 800GB of Cloud Block storage is roughly $2050 a month. Meanwhile, the cost of a single OnMetal I/O node with no Cloud Block Storage is roughly $1780 a month. For our 32-node cluster, that works out to more than $100,000 in savings annually.
The move to OnMetal
We essentially did a one-for-one swap of Performance 2 servers to new OnMetal servers. Because OnMetal doesn’t yet have Cloud Network capability, we had to redo some wiring in the Cassandra configuration and iptables, but that was about the most complicated part of the move.
By switching over to OnMetal, our performance improved dramatically. The graphs below were taken from the period when we switched everything to OnMetal — check out those improvements. The ClientRequest read and write latency improved many times over. The graphs below show the mean, 95th and 99th percentiles for all reads and writes:
As you can see, a pretty dramatic improvement. The mean and the 99th percentile for reads improve threefold, while the 95th percentile are virtually indistinguishable from the mean. For writes, we went from measuring everything in milliseconds to measuring everything in microseconds.
The improvements really weren’t that surprising. First, OnMetal has better performance characteristics than the virtual machines we were using. Second, our old Cassandra setup was non-standard with Cloud Block Storage augmenting our local storage. With OnMetal, all the storage is local to the box — and lo and behold, local storage is faster! Still, it was nice to see confirmation that these machines, with no virtualization layer and whose performance are not impacted by noisy neighbors, help us operate much, much more efficiently.
The main challenge we did encounter was that a couple of the OnMetal nodes would reboot at random times. This only affected a handful of servers, most of them in our staging environment. For those occasions, we asked the OnMetal team to put the machine into maintenance mode so it would not get reprovisioned when we deleted it. So far, simply removing a bad node and spinning up a new one has solved our problems.
Since this was a fairly serious issue, we worked directly with the OnMetal team to isolate the problem. Fortunately, because Cassandra is wonderfully resilient, a box going down for 30 to 45 seconds while it reboots isn’t that big of a deal, so we left one rebooting node so that we could research the problem. Eventually, the OnMetal team discovered that a firmware upgrade to the fan table solved the issue. The last of our rebooting servers has run smoothly for the past several weeks now. The OnMetal team has since deployed the fix throughout the entire fleet of OnMetal servers.
The only other thing we miss from our Performance Cloud days is direct console access. With our old public cloud servers, if there was ever a problem ssh’ing to a machine, we could always go through mycloud.rackspace.com and pull up the java -console app for that server. Since our OnMetal cluster has been very stable, we haven’t had the need for it. Still, the lack of this fallback still concerns the more cautious members of my team.
What did we gain?
- More reliability: by removing our dependencies on Cloud Block Storage and Cloud Networks, we reduced the number of services that could affect our operation. Plus, our OnMetal machines suffer much less chance of being affected by noisy neighbors, so these machines have a better chance of remaining stable.
- Simpler setup: along the same lines as above, without Cloud Block Storage and Cloud Networks, our entire infrastructure is just simpler to set up and maintain.
- Reduced costs: simply put, it’s cheaper. I know, it sounds crazy! You get real, live hardware spun up just like it’s a cloud server, but for less money than the equivalent cloud server plus cloud block storage.
- More performance: we have more than double the RAM now, and even though our old servers had 16 virtual cores to OnMetal I/O nodes’ 10 cores, the performance of these new boxes far outpaces the old.
- More storage: even with Cloud Block Storage, we only provisioned 1.4TB of storage per server. The OnMetal I/O nodes come ready with 1.6TB, so we increased our capacity by 14 percent without even trying.
Finally, we gained real machines. We aren’t just being nostalgic for the good ol’ pre-cloud days: there is something comforting in the thought that our servers are independent, whole servers not beholden to the whim and mercy of some cruel hypervisor.
So far, the impact for the team of using real machines rather than virtual servers is night and day. We spend less time working on our infrastructure and more time working on moving our product forward. We’ll make that trade – especially if we save money in the process – any day of the week.