by Gary Dusbabek
Apache Cassandra is a fully distributed, highly scalable, sparse-table database. It combines Dynamo’s fully distributed design and schema-free ColumnFamily-based data model. Client-tunable eventual consistency allows users to achieve a high degree of consistency while not sacrificing cluster availability or data redundancy.
Many users arrive at Cassandra after reaching the limits of what they can affordably accomplish with a traditional relational database (RDBMS). However, Cassandra is not a drop-in replacement for MySQL or Oracle. It has some features that relational systems lack and is missing some features found relational systems. If you understand your application well and are willing to think about problems differently, Cassandra might be a tool worth exploring.
Distributed and Scalable Cassandra’s decentralized approach means every node in a Cassandra cluster is the same. Adding nodes to an existing cluster is relatively easy. Just make sure your storage settings are correct and then startup the new node. Cassandra takes care of deciding which ranges of data the node is responsible for and replicating the data to it. If you require more control, you are free to perform every step of this process yourself manually.
Schema-free Sparse Table If you are used to SQL tables, Cassandra’s data model is probably the biggest mental hurdle to overcome. One of the easiest ways to conceptualize the Cassandra data model is to imagine many rows, each row containing a list. You are free to add and remove items from these lists, or to ask Cassandra for the values from sections of these lists (we call them slices).
One of the ramifications of being “sparse” is that Cassandra has no notion of NULL—a key-value pair is simply present in a row or it is not. You are free, however, to store a column name with no value associated with it to indicate NULL (the absence of data) to your application.
Shedding Features Cassandra does away with several RDBMS features in the name of performance and scalability. Notably, you will have to do without robust transactions, ad-hoc queries, joins or flexible indexes. These aren’t limitations though. Just ask some of the visible companies using Cassandra to build their applications. These include Facebook, Digg, Twitter and Reddit to name a few.
Why You Would Use It If your application has a very large dataset, high write throughput and requires distributing redundant copies of your data across servers, racks or datacenters, you should consider using Cassandra. Writes are fast because Cassandra’s write path has been optimized to avoid random disk accesses. Server-side caching enables reads to be fast as well, if you need it and have the RAM to spare. Cassandra has the ability for you to specify where your data is replicated, and how many nodes it should be replicated to. This makes Cassandra very fault-tolerant.
Rackspace Supports Cassandra Development Besides having our own internal uses for Cassandra, we at Rackspace believe it is important for you to have the ability to develop your application for Cassandra and deploy it anywhere, not just with a specific cloud provider. Rackspace is committed to an Open Cloud. We currently employ two programmers to work on Cassandra full-time along with several other part-time contributors.
There are three ways you can get more information on Cassandra:
1. IRC. #cassandra for general questions, or #cassandra-dev if you are a programmer looking for answers relating to the codebase.
2. Apache mailing lists. Those interested may subscribe to either a user- or developer-related list, or both. Find out more by sending mail to firstname.lastname@example.org and email@example.com.
3. My email: firstname.lastname@example.org