The Cassandra Project

You may have heard about the Cassandra distributed database in recent articles or conferences. I’d like to explain what advantages Cassandra offers over traditional relational databases like MySQL or Oracle and why Rackspace has committed resources to the Cassandra project.

The Cassandra project was started by Facebook in 2007 to scale their internal applications, particularly Inbox Search. Earlier this year, they released it to the Apache incubator where other people from the community could become involved and start contributing. This allowed  the project to move forward in a direction that is more general to the public than just to Facebook’s needs.

In March, I became the first outside committer to this Apache Incubator project. Eric Evans from Rackspace and Jun Rao from IBM Research soon followed, and we recently added Chris Goffinet from Digg. The community has grown from 5 people in the IRC channel in December to  over 60.

Distributed vs. Relational Databases

Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them.  For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.

Cassandra is taking the approach that, given that you’re going to have to give up some parts of the relational model to scale, let’s start over and rethink things. Let’s add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.

Rackspace’s Involvement

The original Facebook team has been busy elsewhere, so the community has had to step up and take the initiative in moving Cassandra forward.  Cassandra is open source and I don’t want to downplay others’ contributions, including those from IBM Research, Digg, and Twitter as well as other companies and individuals, but I’m proud that Rackspace’s support has been instrumental in adding many important new features, fixing bugs, and getting out new releases.

Here are 3 reasons why Rackspace has committed resources:

1-    As stated in previous posts by Erik Carlin, we are committed to an Open Cloud. With Amazon’s Simple DB or Google App Engine’s datastore, you’re locked in. Cassandra presents an open alternative: you can write against Cassandra and deploy anywhere.  That’s important.

2-    We have a suite of Cloud products that are productized beyond just the raw Cloud Servers. Cassandra is interesting to us because we can use it under the hood to improve Cloud Sites and Cloud Files. And people are already starting to ask, “When can I just go to Rackspace and deploy a preconfigured Cassandra cluster?” It’s still early, but that’s definitely something we’re looking at.

3-    Rackspace itself has a ton of data that we generate from our switches and routers and the rest of our infrastructure. Right now we are getting by with traditional monitoring and logging technologies, searching those logs and so forth. Cassandra will help us a lot with that as our volumes continue to increase. Our Mail & Apps products are also very interested in using Cassandra to store mail messages and other data.

Finally, I want to emphasize Cassandra is not a magic bullet. You can’t just take your SQL app and put it on Cassandra and expect it to work.  It’s a different programming model and instead of modeling as entities and relationships and just adding indexes to get performance, you need to think at a more basic level: “What information do I need to retrieve from each query?” and model your Cassandra schema accordingly.  It’s a different way of thinking and does require new code to be written. It’s very much for people that have a lot data that doesn’t fit on a single machine and are feeling the pain from traditional approaches to scaling that.

We plan to write some other posts in the future detailing what a switch might look like for some sample applications.


  1. What goals might you want from a shared-data system?

    – Strong Consistency: all clients see the same view, even in presence of updates
    – High Availability: all clients can find some replica of the data, even in the presence of failures
    – Partition-tolerance: the system properties hold even when the system is partitioned

    The dichotomy is really consistency vs. concurrency. It is possible to have both, even in large, distributed databases. You’re probably not familiar w/ Digital Equipment Corporations RDBMS named “Rdb”. It’s now owned by Oracle. Rdb implements both transaction access models. The implementor chooses (using declarative locking syntax) the lock model appropriate to the circumstances. I believe that Ingres also implemented these locking models.
    You might also want to add in the “… not a magic bullet.” category that DBMS like Cassandra, Hadoop &c are poor choices for systems that must implement serializable history.

  2. Fred,
    Nothing is a magic bullet, for everything. Ever.

    With that in mind, Rdb has it’s place, so does Cassandra, and Cassandra will not replace traditional SQL anytime soon. But, the promise of a data store I can throw data at, and then consume it later in any fashion I want, is very appealing.

    Not having to pay ludicrous licensing fees is also appealing to my investors 🙂

  3. This is exciting, glad to see more support behind Cassandra!

    As a minor point though, the GAE lock-in is not as bad as it might appear. For one, you could adopt the AppScale project and run your own GAE cluster… on top of cassandra!

  4. The irony is though you shouldn’t just choose Cassandra because your current RDBMS isn’t performing.

    to quote: “just adding indexes to get performance, you need to think at a more basic level: “What information do I need to retrieve from each query?”

    The biggest performance gain you can get is to ask that question of any db (relational or not). The easiest way to get increased performance is to stop doing unnecessary calls to the db.

    Have Fun

    • Why shouldn’t you choose Cassandra (or a similar product) when your RDBMS truly isn’t performing? Are you just trying to make the point that it’s possible to optimize an RDMBS? Sure, a lot of people have efficiencies that could be gained in their systems– but when you have a web-scale dataset, it’s just not going to fit on one machine, no matter how hard you try.

  5. […] been fascinated with the articles, more and more over the last 3 years, arguing for a new style of storing data: Distributed vs. Relational […]

  6. How’s the progress on this coming Jonathan? A Facebook game my team and I have been working on as a recreational side project for the past year is nearing completion, and we were using the Google App Engine up till now for our databasing needs. Unfortunately, they JUST unveiled an absolutely horrendous Datastore pricing that essentially charges on a PER TUPLE RETURNED BASIS that rips our game to pieces due to its high amount of database interactions. I really think this game could be revolutionary, but it’s impossible for us to use Google’s HRD solution in its current form.

    We could definitely port the game to use Cassandra’s structure due to the similarity, but there’s no good hosting solutions available out there!

    Save us Rackspace!


Please enter your comment!
Please enter your name here